python统计文本字符串里面单词出现的频率

2022-03-11 00:00:00 字符串单词频率

"""
作者：皮蛋编程(https://www.pidancode.com)
创建日期：2022/3/27
功能描述：python统计文本字符串里面单词出现的频率
"""

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""
print("Original string:")
print(str1)

# create a list of words separated at whitespaces
wordList1 = str1.split(None)
# strip any punctuation marks and build modified word list
# start with an empty list
wordList2 = []
for word1 in wordList1:
    # last character of each word
    lastchar = word1[-1:]
    # use a list of punctuation marks
    if lastchar in [",", ".", "!", "?", ";"]:
        word2 = word1.rstrip(lastchar)
    else:
        word2 = word1
    # build a wordList of lower case modified words
    wordList2.append(word2.lower())
print("Word list created from modified string:")
print(wordList2)

# create a wordfrequency dictionary
# start with an empty dictionary
freqD2 = {}
for word2 in wordList2:
    freqD2[word2] = freqD2.get(word2, 0) + 1
# create a list of keys and sort the list
# all words are lower case already
keyList = freqD2.keys()
keyList = sorted(keyList)
print("Frequency of each word in the word list (sorted):")
for key2 in keyList:
  print("%-10s %d" % (key2, freqD2[key2]))

输出结果：

Original string:
Man who run in front of car, get tired.
Man who run behind car, get exhausted.
Word list created from modified string:
['man', 'who', 'run', 'in', 'front', 'of', 'car', 'get', 'tired', 'man', 'who', 'run', 'behind', 'car', 'get', 'exhausted']
Frequency of each word in the word list (sorted):
behind     1
car        2
exhausted  1
front      1
get        2
in         1
man        2
of         1
run        2
tired      1
who        2

以上代码在python3.9环境下测试通过

相关文章