如何从我的输入中过滤表情符号字符以便我可以保存在 MySQL <5.5 中?

我有一个 Django 应用程序,它从 Twitter 的 API 获取推文数据并将其保存在 MySQL 数据库中.据我所知(我仍在深入了解字符编码的细节)我在任何地方都使用 UTF-8,包括 MySQL 编码和整理,除非推文包含 Emojistrong> 字符,据我所知使用四字节编码.尝试保存它们会从 Django 中产生以下警告:

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:

/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86:警告:字符串值不正确:'xF0x9Fx98xAD I...' 用于第 1 行的 'text' 列返回 self.cursor.execute(query, args)

/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: 'xF0x9Fx98xAD I...' for column 'text' at row 1 return self.cursor.execute(query, args)

我使用 MySQL 5.1,所以使用 utf8mb4 不是一个选项,除非我升级到 5.5,我宁愿现在还没有(同样从我读过的内容来看,Django 对此的支持还没有完全准备好,尽管这可能不再准确).我还看到 人们 建议使用 BLOB 而不是 TEXT在受影响的列上,我也不想这样做,因为我认为这会损害性能.

I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.

我的问题是,假设我不太担心推文内容的 100% 保留,是否有办法过滤掉所有表情符号字符并将它们替换为非多字节字符,例如可敬的白色中号小方块(U+25FD)?我认为这是根据我当前的设置保存数据的最简单方法,但如果我错过了另一个明显的解决方案,我很乐意听到它!

My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!

仅供参考,我在 Ubuntu 10.04.4 LTS 上使用的是 Python 2.6.5.sys.maxunicode 是 1114111,所以它是一个 UCS-4 版本.

FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.

感谢阅读.

推荐答案

原来这个问题已经被回答过几次了,我只是没有找到合适的 Google-fu 来找到现有的问题.

So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.

  • Python,转换 4-字节字符以避免 MySQL 错误字符串值不正确:"
  • 通过将 4 字节 unicode-to-mysql 插入 mysql 引发警告

感谢 Martijn Pieters,解决方案来自正则表达式的世界,特别是这段代码(基于他对上面第一个链接的回答):

Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):

import re
try:
    # UCS-4
    highpoints = re.compile(u'[U00010000-U0010ffff]')
except re.error:
    # UCS-2
    highpoints = re.compile(u'[uD800-uDBFF][uDC00-uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'u25FD', mytext)

我要替换的字符是 WHITE MEDIUM SMALL SQUARE (U+25FD),仅供参考,但可以是任何东西.

The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.

对于像我这样不熟悉 UCS 的人,这是一个 Unicode 转换系统,并且给定的 Python 版本将包括对 UCS-2 或 UCS-4 变体的支持,每个变体都有不同的字符上限支持.

For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.

添加此代码后,这些字符串在 MySQL 5.1 中似乎保持得很好.

With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.

希望这对处于相同情况的其他人有所帮助!

Hope this helps anyone else in the same situation!

相关文章