utf8_general_ci 和 utf8_unicode_ci 有什么区别?

2021-11-20 00:00:00 unicode utf-8 mysql collation character-set

utf8_general_ci 和 utf8_unicode_ci 在性能上有区别吗?

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

推荐答案

对于那些在 2020 年或之后仍然遇到这个问题的人，有更新的选项可能比两者都更好这些.例如，utf8mb4_0900_ai_ci.

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci.

所有这些排序规则都是针对 UTF-8 字符编码的.不同之处在于文本的排序和比较方式.

All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared.

_unicode_ci 和 _general_ci 是两组不同的规则，用于按照我们期望的方式对文本进行排序和比较.较新版本的 MySQL 也引入了新的规则集，例如 _0900_ai_ci 用于基于 Unicode 9.0 的等效规则 - 并且没有等效的 _general_ci 变体.现在阅读本文的人可能应该使用这些较新的排序规则之一，而不是 _unicode_ci 或 _general_ci.以下对这些旧排序规则的描述仅供参考.

_unicode_ci and _general_ci are two different sets of rules for sorting and comparing text according to the way we expect. Newer versions of MySQL introduce new sets of rules, too, such as _0900_ai_ci for equivalent rules based on Unicode 9.0 - and with no equivalent _general_ci variant. People reading this now should probably use one of these newer collations instead of either _unicode_ci or _general_ci. The description of those older collations below is provided for interest only.

MySQL 目前正在摆脱旧的、有缺陷的 UTF-8 实现.目前，您需要使用 utf8mb4 而不是 utf8 作为字符编码部分，以确保您获得固定版本.有缺陷的版本仍然是为了向后兼容，尽管它已被弃用.

MySQL is currently transitioning away from an older, flawed UTF-8 implementation. For now, you need to use utf8mb4 instead of utf8 for the character encoding part, to ensure you are getting the fixed version. The flawed version remains for backward compatibility, though it is being deprecated.

主要区别

utf8mb4_unicode_ci 基于官方的 Unicode 规则进行通用排序和比较，可在多种语言中准确排序.

utf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages.

utf8mb4_general_ci 是一组简化的排序规则，旨在尽其所能，同时采取许多旨在提高速度的捷径.它不遵循 Unicode 规则，在某些情况下会导致不受欢迎的排序或比较，例如在使用特定语言或字符时.

utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed. It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations, such as when using particular languages or characters.

在现代服务器上，这种性能提升几乎可以忽略不计.它是在服务器的 CPU 性能仅达到当今计算机的一小部分时设计的.

On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

utf8mb4_unicode_ci 优于 utf8mb4_general_ci

Benefits of utf8mb4_unicode_ci over utf8mb4_general_ci

utf8mb4_unicode_ci 使用 Unicode 规则进行排序和比较，它采用相当复杂的算法来在多种语言和使用多种特殊字符时进行正确排序.这些规则需要考虑特定语言的约定；不是每个人都按照我们所说的字母顺序"对他们的字符进行排序.

utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

就拉丁语(即欧洲")语言而言，Unicode 排序与 MySQL 中简化的 utf8mb4_general_ci 排序没有太大区别，但仍有一些区别:

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL, but there are still a few differences:

例如，Unicode 排序规则对ß"进行排序；像ss"和Œ"像OE"正如人们通常想要的那样，而 utf8mb4_general_ci 将它们作为单个字符进行排序(大概分别像s"和e").

For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively).

某些 Unicode 字符被定义为可忽略的，这意味着它们不应计入排序顺序，而应将比较转移到下一个字符.utf8mb4_unicode_ci 正确处理这些.

Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_ci handles these properly.

在非拉丁语言中，例如亚洲语言或具有不同字母表的语言，Unicode 排序和简化的utf8mb4_general_ci 排序之间可能存在很多更多差异.utf8mb4_general_ci 的适用性在很大程度上取决于所使用的语言.对于某些语言，这将是相当不够的.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. The suitability of utf8mb4_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

你应该使用什么?

几乎可以肯定没有理由再使用 utf8mb4_general_ci，因为我们已经忘记了 CPU 速度足够低以至于性能差异很重要的点.您的数据库几乎肯定会受到除此之外的其他瓶颈的限制.

There is almost certainly no reason to use utf8mb4_general_ci anymore, as we have left behind the point where CPU speed is low enough that the performance difference would be important. Your database will almost certainly be limited by other bottlenecks than this.

过去，有些人建议使用 utf8mb4_general_ci，除非准确排序非常重要以证明性能成本是合理的.如今，这种性能成本几乎消失了，开发人员更加重视国际化.

In the past, some people recommended to use utf8mb4_general_ci except when accurate sorting was going to be important enough to justify the performance cost. Today, that performance cost has all but disappeared, and developers are treating internationalization more seriously.

有一种观点认为，如果速度对您来说比准确性更重要，那么您最好根本不进行任何排序.如果您不需要算法准确，那么使算法更快是微不足道的.因此，utf8mb4_general_ci 是一种折衷方案，出于速度原因可能不需要，也可能出于准确性原因也不适合.

There's an argument to be made that if speed is more important to you than accuracy, you may as well not do any sorting at all. It's trivial to make an algorithm faster if you do not need it to be accurate. So, utf8mb4_general_ci is a compromise that's probably not needed for speed reasons and probably also not suitable for accuracy reasons.

我要补充的另一件事是，即使您知道您的应用程序仅支持英语，它可能仍需要处理人名，其中通常可能包含其他语言中使用的字符，在这些语言中它也同样重要正确排序.对所有内容使用 Unicode 规则有助于让您高枕无忧，因为非常聪明的 Unicode 人员已经非常努力地使排序正常工作.

One other thing I'll add is that even if you know your application only supports the English language, it may still need to deal with people's names, which can often contain characters used in other languages in which it is just as important to sort correctly. Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly.

各部分的含义

首先，ci 用于不区分大小写的排序和比较.这意味着它适用于文本数据，大小写并不重要.其他类型的排序规则是 cs(区分大小写)，用于区分大小写的文本数据，以及 bin，用于编码需要匹配的地方，逐位，这适用于真正编码的二进制数据的字段(包括，例如，Base64).区分大小写的排序会导致一些奇怪的结果，区分大小写的比较可能会导致重复值仅在字母大小写中不同，因此区分大小写的排序规则对文本数据不再受欢迎——如果大小写对您很重要，那么其他情况下的标点符号可以忽略等等可能也很重要，二进制排序可能更合适.

Firstly, ci is for case-insensitive sorting and comparison. This means it's suitable for textual data, and case is not important. The other types of collation are cs (case-sensitive) for textual data where case is important, and bin, for where the encoding needs to match, bit for bit, which is suitable for fields which are really encoded binary data (including, for example, Base64). Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case, so case-sensitive collations are falling out of favor for textual data - if case is significant to you, then otherwise ignorable punctuation and so on is probably also significant, and a binary collation might be more appropriate.

接下来，unicode 或 general 指的是特定的排序和比较规则——特别是文本被规范化或比较的方式.utf8mb4 字符编码有许多不同的规则集，unicode 和 general 是两种试图在所有可能的语言而不是一种特定语言中都能很好地工作的规则.这两组规则之间的差异是本答案的主题.请注意，unicode 使用来自 Unicode 4.0 的规则.MySQL 的最新版本使用来自 Unicode 5.2 的规则添加了规则集 unicode_520，并使用来自 Unicode 9.0 的规则添加了 0900(删除了unicode_"部分).

Next, unicode or general refers to the specific sorting and comparison rules - in particular, the way text is normalized or compared. There are many different sets of rules for the utf8mb4 character encoding, with unicode and general being two that attempt to work well in all possible languages rather than one specific one. The differences between these two sets of rules are the subject of this answer. Note that unicode uses rules from Unicode 4.0. Recent versions of MySQL add the rulesets unicode_520 using rules from Unicode 5.2, and 0900 (dropping the "unicode_" part) using rules from Unicode 9.0.

最后，utf8mb4当然是内部使用的字符编码.在这个答案中，我只谈论基于 Unicode 的编码.

And lastly, utf8mb4 is of course the character encoding used internally. In this answer I'm talking only about Unicode based encodings.

相关文章