Python，转换4字节字符以避免MySQL错误“字符串值不正确:"

2021-12-28 00:00:00 python utf-8 character-encoding mysql python-unicode

我需要将(在 Python 中)一个 4 字节的字符转换为其他字符.这是将它插入到我的 utf-8 mysql 数据库中而不会出现错误，例如:不正确的字符串值:'xF0x9Fx94x8E' for column 'line' at row 1"

通过将 4 字节 unicode-to-mysql 插入 mysql 引发警告显示这样做:

<预><代码>>>>进口重新>>>高点 = re.compile(u'[U00010000-U0010ffff]')>>>示例 = u'一些带有困倦脸的示例文本:U0001f62a'>>>highpoints.sub(u'', 例子)u'一些带有困倦脸的示例文本:'

但是，我在评论中遇到与用户相同的错误，...坏字符范围..."这显然是因为我的 Python 是 UCS-2(而不是 UCS-4)构建.但后来我不清楚该怎么做?

解决方案

在 UCS-2 版本中，python 在内部为 U0000ffff 代码点上的每个 unicode 字符使用 2 个代码单元.正则表达式需要与这些一起使用，因此您需要使用以下正则表达式来匹配这些:

highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

此正则表达式匹配使用 UTF-16 代理对编码的任何代码点(参见 UTF-16 代码点 U+10000 到 U+10FFFF.

要使其在 Python UCS-2 和 UCS-4 版本之间兼容，您可以使用 try:/except 来使用一个或另一个:

尝试:高点 = re.compile(u'[U00010000-U0010ffff]')除了重新错误:# UCS-2 构建highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

关于 UCS-2 python 构建的演示:

<预><代码>>>>进口重新>>>highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')>>>示例 = u'一些带有困倦脸的示例文本:U0001f62a'>>>highpoints.sub(u'', 例子)u'一些带有困倦脸的示例文本:'

I need to convert (in Python) a 4-byte char into some other character. This is to insert it into my utf-8 mysql database without getting an error such as: "Incorrect string value: 'xF0x9Fx94x8E' for column 'line' at row 1"

Warning raised by inserting 4-byte unicode to mysql shows to do it this way:

>>> import re
>>> highpoints = re.compile(u'[U00010000-U0010ffff]')
>>> example = u'Some example text with a sleepy face: U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

However, I get the same error as the user in the comment, "...bad character range.." This is apparently because my Python is a UCS-2 (not UCS-4) build. But then I am not clear on what to do instead?

解决方案

In a UCS-2 build, python uses 2 code units internally for each unicode character over the U0000ffff code point. Regular expressions need to work with those, so you'd need to use the following regular expression to match these:

highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

This regular expression matches any code point encoded with a UTF-16 surrogate pair (see UTF-16 Code points U+10000 to U+10FFFF.

To make this compatible across Python UCS-2 and UCS-4 versions, you could use a try:/except to use one or the other:

try:
    highpoints = re.compile(u'[U00010000-U0010ffff]')
except re.error:
    # UCS-2 build
    highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')

Demonstration on a UCS-2 python build:

>>> import re
>>> highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
>>> example = u'Some example text with a sleepy face: U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

相关文章