正在为文本创建签名

2022-05-30 00:00:00 python directory glob database txt

问题描述

我正在创建一个程序，其中我需要读取txt文件中的所有行和单词，计算单词出现的次数，并对多个txt文件执行此操作。然后，我需要创建最常用的25个单词&Signature&，并将每个txt文件的签名与其他文件进行比较。我今天工作了一整天，为了得到统计每一条文本中的字数的程序，但我在如何获得签名的问题上卡住了。基本上，当程序运行时，会显示以下内容：

该程序创建一个名为Word的列，该列显示文本中的所有单词以及它们在每个文本文件中出现的次数。我现在有两个，但以后我会有更多的。我需要对这个单词列表进行排序，这样出现最多的前25个单词将成为签名的一部分，并存储在一个列表中，每个文本一个列表。我不知道怎么把这么多字整理好。我一直在思考如何做到这一点，我想创建一个列表，但我认为这不会奏效。有没有人能给我一些建议，并展示一些代码？我还可以私下向您展示该程序，并以这种方式展示代码的更改。考虑到我今天花了这么长时间，任何帮助都是非常好的。提前感谢！

解决方案

您可以试试

import pandas as pd

df = pd.DataFrame([['word1',1,222], ['word2',10,20],['word3',111,1],['word4',11,62]], columns =['word', 'file1','file2'])

#Convert the columns containing word count to numeric
df['file1'] = pd.to_numeric(df['file1'])
df['file2'] = pd.to_numeric(df['file2'])

wordlist =[]
for column in df.columns:
    if column != 'word':
        #sort datatable columnwise and pick the top words from word column. Replace the value 3 by the required number.
        #append it to a list of lists
        wordlist.append([df.sort_values(column, ascending=False)['word'].head(3)])

print(wordlist)

相关文章