surrogate-pairs – Row Coding

JavaScript strings outside of the BMP

September 16, 2023 by Tarik

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can. But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, … Read more

How to use unicode in Android resource?

August 31, 2023 by Tarik

Your character (U+1F4E1) is outside of Unicode BMP (Basic Multilingual Plane – range from U+0000 to U+FFFF). Unfortunately, Android has very weak (if any) support for non-BMP characters. UTF-8 representation for non-BMP characters requires 4 bytes (0xF0 0x9F 0x93 0xA1). But, Android UTF-8 parser only understands 3 bytes maximum (see it here and here). It … Read more

How can I convert surrogate pairs to normal string in Python?

July 16, 2023 by Tarik

You’ve mixed a literal string \ud83d in a json file on disk (six characters: \ u d 8 3 d) and a single character u’\ud83d’ (specified using a string literal in Python source code) in memory. It is the difference between len(r’\ud83d’) == 6 and len(‘\ud83d’) == 1 on Python 3. If you see ‘\ud83d\ude4f’ … Read more

What are the most common non-BMP Unicode characters in actual use? [closed]

November 6, 2022 by Tarik

Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter’s public stream. It occurs more frequently than the tilde!

What is a “surrogate pair” in Java?

October 12, 2022 by Tarik

The term “surrogate pair” refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. … Read more