MySQL 中的 utf8mb4 和 utf8 字符集有什么区别?

MySQL 中的 utf8mb4utf8 字符集有什么区别?

What is the difference between utf8mb4 and utf8 charsets in MySQL?

我已经了解 ASCII、UTF-8、UTF-16 和 UTF-32 编码;但我很想知道 utf8mb4 组编码与 MySQL 服务器 中定义的其他编码类型有什么区别.

I already know about ASCII, UTF-8, UTF-16 and UTF-32 encodings; but I'm curious to know whats the difference of utf8mb4 group of encodings with other encoding types defined in MySQL Server.

使用 utf8mb4 而不是 utf8 有什么特别的好处/建议吗?

Are there any special benefits/proposes of using utf8mb4 rather than utf8?

推荐答案

UTF-8 是一种变长编码.对于 UTF-8,这意味着存储一个代码点需要一到四个字节.但是,MySQL 的编码称为utf8".(utf8mb3"的别名)每个代码点最多只能存储三个字节.

UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.

所以字符集utf8"/utf8mb3"无法存储所有 Unicode 代码点:它仅支持 0x000 到 0xFFFF 范围,这称为基本多语言平面".另请参阅Unicode 编码的比较.

So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.

这就是(同一页面的先前版本)MySQL 文档 不得不说:

This is what (a previous version of the same page at) the MySQL documentation has to say about it:

名为 utf8[/utf8mb3] 的字符集每个字符最多使用三个字节,并且只包含 BMP 字符.从 MySQL 5.5.3 开始,utf8mb4 字符集每个字符最多使用四个字节,支持补充字符:

The character set named utf8[/utf8mb3] uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:

  • 对于 BMP 字符,utf8[/utf8mb3] 和 utf8mb4 具有相同的存储特性:相同的代码值、相同的编码、相同的长度.

  • For a BMP character, utf8[/utf8mb3] and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.

对于增补字符,utf8[/utf8mb3]根本无法存储该字符,而utf8mb4需要四个字节来存储.由于 utf8[/utf8mb3] 根本无法存储字符,因此您在 utf8[/utf8mb3] 列中没有任何补充字符,并且从旧版本的 utf8[/utf8mb3] 数据升级时无需担心转换字符或丢失数据MySQL.

For a supplementary character, utf8[/utf8mb3] cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8[/utf8mb3] cannot store the character at all, you do not have any supplementary characters in utf8[/utf8mb3] columns and you need not worry about converting characters or losing data when upgrading utf8[/utf8mb3] data from older versions of MySQL.

因此,如果您希望您的列支持存储位于 BMP 之外的字符(并且您通常希望如此),例如 表情符号,使用utf8mb4".另请参阅最常见的非-BMP Unicode 字符在实际使用中?.

So if you want your column to support storing characters lying outside the BMP (and you usually want to), such as emoji, use "utf8mb4". See also What are the most common non-BMP Unicode characters in actual use?.

相关文章