Is the above analysis correct
Let’s see.
you can’t validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght)
returns the number of valid bytes in the array.
you can’t find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can’t iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1)
gives you a possibility to iterate over UTF-8 “characters” (Unicode code units), and of course determine their number (that’s not an “easy” way to count the number of characters, but it’s a way).
doesn’t really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>
. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other “you can’t”, please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.