从python中的字符串中提取英文单词
问题描述
我有一个文件,每一行都是一个字符串.它可能包含数字、非英文字母和单词、符号(例如 ! 和 *).我想从每一行中提取英文单词(英文单词用空格分隔).我的代码如下,这是我的 map-reduce 作业的 map 函数.但是,根据最终结果,此映射器函数仅生成字母(例如 a、b、c)的频率计数.任何人都可以帮我找到错误吗?谢谢
I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space). My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s %s' % (word, 1)
解决方案
你实际上遇到了两个问题.
You've actually got two problems.
首先,这个:
line = re.sub("[^A-Za-z]", "", line.strip())
这会从该行中删除所有非字母.这意味着您不再有任何空格可以拆分,因此无法将其分成单词.
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
接下来,即使你没有这样做,你也要这样做:
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
这不会给你一个单词列表,它会给你一个字符串,所有这些单词连接在一起.(基本上,所有运行的空白都转换为单个空格的原始行.)
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
所以,在下一行,当你这样做时:
So, in the next line, when you do this:
for word in words:
您正在遍历一个字符串,这意味着每个 word
都是一个字符.因为这就是字符串:字符的可迭代.
You're iterating over a string, which means each word
is a single character. Because that's what strings are: iterables of characters.
如果你想要每个单词(正如你的变量名所暗示的那样),你已经有了这些,问题是你把它们重新加入到一个字符串中.只是不要这样做:
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
或者,如果你想去掉字母和空格之外的东西,请使用一个去掉字母和空格之外的所有东西的正则表达式,而不是去掉除字母之外的所有东西,包括空格:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-zs]", "", line.strip())
words = line.split()
for word in words:
但是,这种模式可能仍然不是您想要的.你真的想把'abc1def'
变成单个字符串'abcdef'
,还是变成两个字符串'abc'
和'定义'
?你可能想要这个:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def'
into the single string 'abcdef'
, or into the two strings 'abc'
and 'def'
? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
……或者只是:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words:
相关文章