如何计算一组文档的词频?

2022-01-15 00:00:00 lucene java

我有一个包含以下文档的 Lucene 索引:

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

所以这 5 个文档使用了 14 个不同的术语:

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

每个词的频率:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

为了方便阅读:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

我现在想知道的是,如何获得一组词频向量文件?

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

例如:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

将导致输出:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

删除所有零:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

注意,结果向量仅包含文件.不是整个指数的整体频率!术语行星"在整个索引中出现 4 次,但源集of 文档仅包含 2 次.

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

一个简单的实现是只遍历docs 设置、创建地图并计算每个术语.但我需要一个也适用于文档集大小的解决方案100.000 或 500.000.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

我可以使用 Lucene 中的某个功能来获取此术语向量吗?如果没有这样的功能,数据结构会是什么样子有人可以在索引时创建以获得这样的术语向量方便快捷?

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

我不是 Lucene 专家,所以如果解决方案是显而易见的或微不足道的,我很抱歉.

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

也许值得一提:该解决方案应该足够快地用于 Web 应用程序,应用于客户端搜索查询.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

推荐答案

到这里:http://lucene.apache.org/java/3_0_1/api/core/index.html 并检查这个方法

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

您必须知道文档 ID.这是一个内部 lucene id,它通常会在每次索引更新时更改(具有删除 :-)).

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

相信lucene 2.x.x也有类似的方法

I believe there is a similar method for lucene 2.x.x

相关文章