如何修复“不正确的字符串值"错误?

2021-11-20 00:00:00 mysql

在注意到应用程序由于不正确的字符串值错误而倾向于丢弃随机电子邮件后，我继续并切换了许多文本列以使用 utf8 列字符集和默认列整理 (utf8_general_ci) 以便它接受它们.这修复了大部分错误，并使应用程序在遇到非拉丁电子邮件时也不再出现 sql 错误.

尽管如此，一些电子邮件仍然导致程序命中不正确的字符串值错误:(Incorrect string value: '\xE4\xC5\xCC\xC9\xD3\xD8...' for column '内容'在第 1) 行

内容列是一个 MEDIUMTEXT 数据类型，它使用 utf8 列字符集和 utf8_general_ci 列整理.在此列中没有我可以切换的标志.

请记住，除非绝对必要，否则我不想接触甚至查看应用程序源代码:

是什么导致了该错误?(是的，我知道电子邮件中充满了随机垃圾，但我认为 utf8 会非常宽松)
我该如何解决?
这种修复可能产生哪些影响?

我考虑的一件事是切换到打开二进制标志的 utf8 varchar([some large number])，但我对 MySQL 相当不熟悉，也不知道这样的修复是否有意义.

解决方案

"\xE4\xC5\xCC\xC9\xD3\xD8" 不是有效的 UTF-8.使用 Python 测试:

<预><代码>>>>"\xE4\xC5\xCC\xC9\xD3\xD8".decode("utf-8")...UnicodeDecodeError: 'utf8' 编解码器无法解码位置 0-2 中的字节:无效数据

如果您正在寻找一种方法来避免数据库中的解码错误，cp1252 编码(又名Windows-1252"又名Windows 西欧")是最宽松的编码 - 每个字节值都是有效的代码点.

当然它不会再理解真正的 UTF-8，也不会再理解任何其他非 cp1252 编码，但听起来你不太关心这个?

After noticing an application tended to discard random emails due to incorrect string value errors, I went though and switched many text columns to use the utf8 column charset and the default column collate (utf8_general_ci) so that it would accept them. This fixed most of the errors, and made the application stop getting sql errors when it hit non-latin emails, too.

Despite this, some of the emails are still causing the program to hit incorrect string value errrors: (Incorrect string value: '\xE4\xC5\xCC\xC9\xD3\xD8...' for column 'contents' at row 1)

The contents column is a MEDIUMTEXT datatybe which uses the utf8 column charset and the utf8_general_ci column collate. There are no flags that I can toggle in this column.

Keeping in mind that I don't want to touch or even look at the application source code unless absolutely necessary:

What is causing that error? (yes, I know the emails are full of random garbage, but I thought utf8 would be pretty permissive)
How can I fix it?
What are the likely effects of such a fix?

One thing I considered was switching to a utf8 varchar([some large number]) with the binary flag turned on, but I'm rather unfamiliar with MySQL, and have no idea if such a fix makes sense.

解决方案

"\xE4\xC5\xCC\xC9\xD3\xD8" isn't valid UTF-8. Tested using Python:

>>> "\xE4\xC5\xCC\xC9\xD3\xD8".decode("utf-8")
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data

If you're looking for a way to avoid decoding errors within the database, the cp1252 encoding (aka "Windows-1252" aka "Windows Western European") is the most permissive encoding there is - every byte value is a valid code point.

Of course it's not going to understand genuine UTF-8 any more, nor any other non-cp1252 encoding, but it sounds like you're not too concerned about that?

相关文章