JavaScript strings outside of the BMP

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can. But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, … Read more

How to use unicode in Android resource?

Your character (U+1F4E1) is outside of Unicode BMP (Basic Multilingual Plane – range from U+0000 to U+FFFF). Unfortunately, Android has very weak (if any) support for non-BMP characters. UTF-8 representation for non-BMP characters requires 4 bytes (0xF0 0x9F 0x93 0xA1). But, Android UTF-8 parser only understands 3 bytes maximum (see it here and here). It … Read more

How can I convert surrogate pairs to normal string in Python?

You’ve mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u’\ud83d’ (specified using a string literal in Python source code) in memory. It is the difference between len(r’\ud83d’) == 6 and len(‘\ud83d’) == 1 on Python 3. If you see ‘\ud83d\ude4f’ … Read more

What is a “surrogate pair” in Java?

The term “surrogate pair” refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. … Read more