What’s the difference between utf8_unicode_ci and utf8mb4_0900_ai_ci

The encoding is the same. That is, the bytes look the same.
The character set is different. utf8mb4 has more characters.
The collation (how comparisions are done) is different.
The perfomance is different, but it rarely matters.

utf8_unicode_ci implies the CHARACTER SET utf8, which includes only the 1-, 2-, and 3-byte UTF-8 characters. Hence it excludes most Emoji and some Chinese characters.

utf8mb4_unicode_ci implies the CHARACTER SET utf8mb4 is the corresponding COLLATION for the 4-byte CHARACTER SET utf8mb4.

The Unicode organization has been evolving the specification over the years. Here are the mappings from its “versions” to MySQL Collations:

4.0   _unicode_
5.2.0 _unicode_520_ (Unicode 2009; MySQL GA 5.6 2013)
9.0   _0900_
14.0  _uca1400_ai_ci etc.  as/ai and cs/ci (MariaDB-10.10, not MySQL)

Most of the differences will be in areas that most people never encounter. One example: At some point, a change allowed Emoji to be distinguished and ordered in some manner.

The suffix (MySQL doc):

_bin      -- just compare the bits; don't consider case folding, accents, etc
_ci       -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci    -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)

Performance:

_bin         -- simple, fast
_general_ci  -- fails to compare multiple letters; eg ss=ß, so somewhat fast
...          -- slower
_900_        -- (8.0) much faster because of a rewrite

However: The speed of collation is usually the least of the performance issues in queries. INDEXes, JOINs, subqueries, table scans, etc are much more critical to performance.

Leave a Comment Cancel reply