如何对用 lucene 索引的文档进行分类

我用 Lucene 对一组文档进行了分类(字段:内容、类别).每个文档都有自己的类别,但其中一些被标记为未分类.有没有什么方法可以在java中轻松分类这些文档?

I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?

推荐答案

从 Lucene 5.2.1 开始,您可以使用 索引文档以对新文档进行分类.开箱即用,Lucene 提供了一个朴素贝叶斯分类器,一个 k-最近邻分类器(基于 MoreLikeThis 类)和基于感知器的分类器.

As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.

缺点是所有这些类都标有实验性警告,并附有维基百科的链接.

The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.

相关文章