如何检测 Latin1 编码列中的 UTF-8 字符 - MySQL
我即将承担将数据库从 Latin1 转换为 UTF-8 的繁琐且充满陷阱的任务.
I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
此时我只想检查我的表中存储了哪些类型的数据,因为这将决定我应该使用什么方法来转换数据.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
具体来说,我想检查 Latin1 列中是否有 UTF-8 字符,最好的方法是什么?如果只有几行受到影响,那么我可以手动修复此问题.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
选项 1. 执行 MySQL 转储并使用 Perl 搜索 UTF-8 字符?
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
选项 2. 使用 MySQL CHAR_LENGTH 查找具有多字节字符的行?例如SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
够了吗?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
目前我已将 Mysql 客户端编码切换为 UTF-8.
At the moment I have switched my Mysql client encoding to UTF-8.
推荐答案
字符编码,就像时区一样,是一个不断出现问题的根源.
Character encoding, like time zones, is a constant source of problems.
您可以做的是查找任何高位 ASCII"字符,因为这些字符要么是 LATIN1 重音字符或符号,要么是 UTF-8 多字节字符的第一个.除非你作弊,否则很难区分.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
要确定哪种编码是正确的,您只需SELECT
两个不同的版本并进行视觉比较.举个例子:
To figure out what encoding is correct, you just SELECT
two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
这变得异常复杂,因为 MySQL 正则表达式引擎似乎忽略了诸如 \x80
之类的东西,并且必须使用 UNHEX()
方法来代替.
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80
and makes it necessary to use the UNHEX()
method instead.
这会产生如下结果:
latin1 utf8
----------------------------------------
Björn Björn
相关文章