UTF-8:一般?斌?统一码?

2021-11-20 00:00:00 utf-8 mysql collation

我想弄清楚我应该对各种类型的数据使用什么排序规则.我将存储的内容 100% 是用户提交的.

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

我的理解是我应该使用 UTF-8 通用 CI(不区分大小写)而不是 UTF-8 二进制.但是，我找不到 UTF-8 General CI 和 UTF-8 Unicode CI 之间的明确区别.

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

我应该将用户提交的内容存储在 UTF-8 General 还是 UTF-8 Unicode CI 列中?
UTF-8 二进制适用于什么类型的数据?

推荐答案

总的来说，utf8_general_ci 比 utf8_unicode_ci 快，但不太正确.

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.

区别如下:

对于任何 Unicode 字符集，使用 _general_ci 排序规则执行的操作比使用 _unicode_ci 排序规则执行的操作快.例如，与 utf8_unicode_ci 的比较相比，utf8_general_ci 归类的比较速度更快，但准确性稍差.这样做的原因是utf8_unicode_ci支持扩展等映射；也就是说，当一个字符与其他字符的组合相等时.例如，在德语和其他一些语言中，ß"等于ss".utf8_unicode_ci 还支持收缩和可忽略的字符.utf8_general_ci 是不支持扩展、收缩或可忽略字符的旧排序规则.它只能在字符之间进行一对一的比较.

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages "ß" is equal to "ss". utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

引用自:http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

更详细的解释，请阅读 MySQL 论坛中的以下帖子:http://forums.mysql.com/read.php?103,187048,188748

For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748

至于 utf8_bin:utf8_general_ci 和 utf8_unicode_ci 都执行不区分大小写的比较.相比之下，utf8_bin 区分大小写(除其他差异外)，因为它比较字符的二进制值.

As for utf8_bin: Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.

相关文章