Python,转换4字节字符以避免MySQL错误“字符串值不正确:"
我需要将(在 Python 中)一个 4 字节的字符转换为其他字符.这是将它插入到我的 utf-8 mysql 数据库中而不会出现错误,例如:不正确的字符串值:'xF0x9Fx94x8E' for column 'line' at row 1"
通过将 4 字节 unicode-to-mysql 插入 mysql 引发警告 显示这样做:
<预><代码>>>>进口重新>>>高点 = re.compile(u'[U00010000-U0010ffff]')>>>示例 = u'一些带有困倦脸的示例文本:U0001f62a'>>>highpoints.sub(u'', 例子)u'一些带有困倦脸的示例文本:'但是,我在评论中遇到与用户相同的错误,...坏字符范围..."这显然是因为我的 Python 是 UCS-2(而不是 UCS-4)构建.但后来我不清楚该怎么做?
解决方案在 UCS-2 版本中,python 在内部为 U0000ffff
代码点上的每个 unicode 字符使用 2 个代码单元.正则表达式需要与这些一起使用,因此您需要使用以下正则表达式来匹配这些:
highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
此正则表达式匹配使用 UTF-16 代理对编码的任何代码点(参见 UTF-16 代码点 U+10000 到 U+10FFFF.
要使其在 Python UCS-2 和 UCS-4 版本之间兼容,您可以使用 try:
/except
来使用一个或另一个:
尝试:高点 = re.compile(u'[U00010000-U0010ffff]')除了重新错误:# UCS-2 构建highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
关于 UCS-2 python 构建的演示:
<预><代码>>>>进口重新>>>highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')>>>示例 = u'一些带有困倦脸的示例文本:U0001f62a'>>>highpoints.sub(u'', 例子)u'一些带有困倦脸的示例文本:'I need to convert (in Python) a 4-byte char into some other character. This is to insert it into my utf-8 mysql database without getting an error such as: "Incorrect string value: 'xF0x9Fx94x8E' for column 'line' at row 1"
Warning raised by inserting 4-byte unicode to mysql shows to do it this way:
>>> import re
>>> highpoints = re.compile(u'[U00010000-U0010ffff]')
>>> example = u'Some example text with a sleepy face: U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
However, I get the same error as the user in the comment, "...bad character range.." This is apparently because my Python is a UCS-2 (not UCS-4) build. But then I am not clear on what to do instead?
解决方案In a UCS-2 build, python uses 2 code units internally for each unicode character over the U0000ffff
code point. Regular expressions need to work with those, so you'd need to use the following regular expression to match these:
highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
This regular expression matches any code point encoded with a UTF-16 surrogate pair (see UTF-16 Code points U+10000 to U+10FFFF.
To make this compatible across Python UCS-2 and UCS-4 versions, you could use a try:
/except
to use one or the other:
try:
highpoints = re.compile(u'[U00010000-U0010ffff]')
except re.error:
# UCS-2 build
highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
Demonstration on a UCS-2 python build:
>>> import re
>>> highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
>>> example = u'Some example text with a sleepy face: U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
相关文章