使用 ASCII/拉丁字符集会加速数据库吗?

2022-01-15 00:00:00 mariadb mysql utf8mb4 character-set

似乎对大多数字段使用 ASCII 字符集，然后只为需要它的字段指定 utf8 会减少数据库必须执行的 I/O 量 100%.

有人知道这是不是真的吗?

更新:以上不是我的问题.我应该说:使用拉丁语作为默认字符集，然后只为需要它的字段指定 utf8mb4.想法是:使用 1 字节与 2 字节应该将 I/O 提高 100%.很抱歉造成混乱.

解决方案

@RickJames 是对的，您不必担心通过选择 ASCII 或 utf8 而不是 utf8mb4 来节省空间.

utf8 和 utf8mb4 是变长字符编码.维基百科的文章解释得很清楚:

<块引用>

前 128 个字符(US-ASCII)需要一个字节.接下来的 1,920 个字符需要两个字节进行编码，这涵盖了几乎所有拉丁字母表的其余部分，以及希腊语、西里尔语、科普特语、亚美尼亚语、希伯来语、阿拉伯语、叙利亚语、Thaana 和 N'Ko 字母，以及组合变音符号分数.基本多语言平面的其余字符需要三个字节，其中包含几乎所有常用字符，包括大多数中文、日文和韩文字符.Unicode 其他平面中的字符需要四个字节，其中包括不太常见的 CJK 字符、各种历史文字、数学符号和表情符号(象形符号).

您无需执行任何操作即可选择单字节与多字节模式.这就是编码的工作方式.每个字符自动使用它需要的字节数，仅此而已.

因此，使用 utf8 比使用 utf8mb4 没有优势，使用 ASCII 也没有任何优势，除非您需要限制字符串中允许的字符.

不管怎样，MySQL 调用的字符集utf8"是 utf8mb3 的别名，它只是 UTF8 编码的前三个字节的实现.MySQL 服务器团队博客 (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) 说 utf8mb4 更快，至少考虑到 MySQL 8.0 的性能改进，应该考虑弃用 utf8mb3.MySQL 8.0.11 发行说明说 utf8 在 MySQL 的某些未来版本中将被重新定义为 utf8mb4 的别名.

It would seem that using the ASCII charset for most fields and then specify utf8 only for the fields that need it would reduce the amount of I/O the database must perform by 100%.

Anyone know if this is true?

Update: The above was not really my question. I should have said: use Latin for the default character set and then only specify utf8mb4 only for the fields that need it. The thinking being that: using 1 byte vs 2 bytes should improve I/O by 100%. Sorry for the confusion.

解决方案

@RickJames is right, you should not worry about saving space by choosing ASCII or utf8 over utf8mb4.

utf8 and utf8mb4 are variable-length character encodings. This table from wikipedia illustrates how characters automatically take 1, 2, 3, or 4 bytes each, depending on the value encoded. If the high bit of a byte is set, then the character uses an additional byte, up to 4 bytes.

The wikipedia article explains it clearly:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

You don't have to do anything to choose single-byte versus multi-byte mode. This is just the way the encoding works. Each character automatically uses the number of bytes it needs, and no more.

So there is no advantage to using utf8 over utf8mb4, and no advantage of using ASCII over either, unless you need to restrict the characters allowed in a string.

For what it's worth, the character set MySQL calls "utf8" is an alias for utf8mb3, an implementation of just the first three bytes of the UTF8 encoding. The MySQL server team blog (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) says that utf8mb4 is faster, at least given performance improvements in MySQL 8.0, and utf8mb3 should be considered deprecated. MySQL 8.0.11 release notes say that utf8 will be redefined as an alias for utf8mb4 in some future version of MySQL.

相关文章