Unicode in C++11 – Row Coding

Is the above analysis correct

Let’s see.

you can’t validate an array of bytes as containing valid UTF-8

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.

you can’t find out the length

Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.

you can’t iterate over a std::string in any way other than byte-by-byte

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 “characters” (Unicode code units), and of course determine their number (that’s not an “easy” way to count the number of characters, but it’s a way).

doesn’t really support UTF-16

Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.

Demo that illustrates these points.

If I have missed some other “you can’t”, please point it out and I will address it.

Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.

Leave a Comment Cancel reply