Is C++20 ‘char8_t’ the same as our old ‘char’?

Disclaimer: I’m the author of the char8_t P0482 and P1423 proposals.

In C++20, char8_t is a distinct type from all other types. In the related proposal for C, N2653, char8_t is a typedef of unsigned char similar to the existing typedefs for char16_t and char32_t.

In C++20, char8_t has an underlying representation that matches unsigned char. It therefore has the same size (at least 8-bit, but may be larger), alignment, and integer conversion rank as unsigned char, but has different aliasing rules.

In particular, char8_t was not added to the list of types at [basic.lval]p11. [basic.life]p6.4, [basic.types]p2, or [basic.types]p4. This means that, unlike unsigned char, it cannot be used for the underlying storage of objects of another type, nor can it be used to examine the underlying representation of objects of other types; in other words, it cannot be used to alias other types. A consequence of this is that objects of type char8_t can be accessed via pointers to char or unsigned char, but pointers to char8_t cannot be used to access char or unsigned char data. In other words:

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

The motivation for a distinct type with these properties is:

  1. To provide a distinct type for UTF-8 character data vs character data with an encoding that is either locale dependent or that requires separate specification.

  2. To enable overloading for ordinary string literals vs UTF-8 string literals (since they may have different encodings).

  3. To ensure an unsigned type for UTF-8 data (whether char is signed or unsigned is implementation defined).

  4. To enable better performance via a non-aliasing type; optimizers can better optimize types that do not alias other types.

Leave a Comment