Python:在单词边界上拆分 unicode 字符串

问题描述

我需要一个字符串,并将其缩短为 140 个字符.

I need to take a string, and shorten it to 140 characters.

目前我在做:

if len(tweet) > 140:
    tweet = re.sub(r"s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

所以这对英文和类似英文的字符串非常有效,但对于中文字符串却失败了,因为 tweet.split() 只返回一个数组:

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> s
u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002'
>>> s.split()
[u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002']

我应该怎么做才能处理 I18N?这对所有语言都有意义吗?

How should I do this so it handles I18N? Does this make sense in all languages?

如果这很重要,我正在使用 python 2.5.4.

I'm on python 2.5.4 if that matters.


解决方案

在与一些母语为粤语、普通话和日语的人交谈后,似乎很难做正确的事情,但我目前的算法仍然对他们有意义互联网帖子的上下文.

After speaking with some native Cantonese, Mandarin, and Japanese speakers it seems that the correct thing to do is hard, but my current algorithm still makes sense to them in the context of internet posts.

意思是,它们习惯于在空间上分割并在末尾添加……"的处理方式.

Meaning, they are used to the "split on space and add … at the end" treatment.

所以我会偷懒并坚持下去,直到我收到不理解它的人的抱怨.

So I'm going to be lazy and stick with it, until I get complaints from people that don't understand it.

对我的原始实现的唯一更改是不要在最后一个单词上强制使用空格,因为它在任何语言中都是不需要的(并使用 unicode 字符 ... &#x2026 而不是 ... 三个点 保存2个字符)

The only change to my original implementation would be to not force a space on the last word since it is unneeded in any language (and use the unicode character … &#x2026 instead of ... three dots to save 2 characters)

相关文章