MySQL 匹配 unicode 字符与 ascii 版本

2021-12-19 00:00:00 unicode select match mysql

我正在运行 MySQL 5.1.50 并且有一个如下所示的表:

I'm running MySQL 5.1.50 and have a table that looks like this:

organizations | CREATE TABLE `organizations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |

我遇到的问题是 MySQL 将 unicode 字符与 ascii 版本匹配.例如,当我搜索包含 'é' 的单词时,它将匹配具有 'e' 的同一个单词,反之亦然:

The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
    +-------+-------------------------+
| id    | name                    |
+-------+-------------------------+
| 16973 | Université de Montreal  |
+-------+-------------------------+
1 row in set (0.01 sec)

我从 PHP 和命令行控制台得到这些结果.如何从我的 SELECT 查询中获得准确匹配?

I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?

谢谢!

推荐答案

您将 name 列指定为 text CHARACTER SET utf8 COLLATE utf8_unicode_ci 这告诉 MySQL 考虑 e 和 é 在匹配和排序中是等效的.该排序规则和 utf8_general_ci 都使很多东西等效.

You specified the name column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci both make a lot of things equivalent.

http://www.collat​​ion-charts.org/ 是您学习后的绝佳资源如何阅读图表,这很容易.

http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.

如果您希望 e 和 é 等被视为不同,那么您必须选择不同的排序规则.要了解您的服务器上的排序规则(假设您仅限于 UTF-8 编码):

If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):

mysql> show collation like 'utf8%';

并选择使用整理图表作为参考.

And choose using the collation charts as a reference.

另一种特殊的排序规则是 utf8_bin,其中没有等效项,它是二进制匹配.

One more special collation is utf8_bin in which there are no equivalencies, it's a binary match.

我所知道的唯一非语言特定的 MySQL Unicode 排序规则是 utf8_unicode_ciutf8_general_ciutf8_bin.他们比较奇怪.归类的真正目的是使计算机按照某个地方的人所期望的方式进行匹配和排序.匈牙利语和土耳其语词典的条目按照不同的规则排序.指定排序规则允许您根据此类本地规则进行排序和匹配.

The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci, utf8_general_ci and utf8_bin. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.

例如,丹麦人似乎认为 e 和 é 是等价的,但冰岛人不认为:

For example, it seems Danes consider e and é equivalent but Icelanders don't:

mysql> select _utf8'e' collate utf8_danish_ci
    -> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
|     1 |
+-------+

mysql> select _utf8'e' collate utf8_icelandic_ci
    -> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
|     0 |
+-------+

另一个方便的技巧是用一堆您感兴趣的字符填充一个单列表(从脚本中更容易),然后 MySQL 可以告诉您等效项:

Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á           |
+-----------------+

相关文章