如何在 Lucene 3.0.2 中索引和搜索文本文件?

2022-01-15 00:00:00 indexing text-files lucene java

我是 Lucene 的新手,在创建查询文本文件集合的简单代码时遇到了一些问题.

I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.

我试过 这个例子,但是和新版本的Lucene不兼容.

I tried this example, but is incompatible with the new version of Lucene.

UDPATE: 这是我的新代码,但还是不行还没有.

UDPATE: This is my new code, but it still doesn't work yet.

推荐答案

Lucene 是一个相当大的话题,涉及到很多类和方法,如果不了解一些基本概念,通常是无法使用它的.如果您需要快速可用的服务,请改用 Solr.如果您需要完全控制 Lucene,请继续阅读.我将介绍一些代表它们的核心 Lucene 概念和类.(有关如何在内存中读取文本文件的信息读取,例如,this 文章).

Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).

无论您要在 Lucene 中做什么 - 索引或搜索 - 您都需要一个分析器.分析器的目标是对输入文本进行标记(分解成单词)和词干(获取单词的基础).它还会抛出最常用的词,如a"、the"等.您可以找到超过 20 种语言的分析器,或者您可以使用 SnowballAnalyzer 并将语言作为参数传递.
要为英语创建 SnowballAnalyzer 的实例,请执行以下操作:

Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

如果你要索引不同语言的文本,并且想自动选择分析器,你可以使用 tika 的语言标识符.

If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.

您需要将索引存储在某个地方.这有两种主要的可能性:易于尝试的内存索引和使用最广泛的磁盘索引.
使用接下来的 2 行中的任何一行:

You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:

Directory directory = new RAMDirectory();   // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage

当你想添加、更新或删除文档时,你需要IndexWriter:

When you want to add, update or delete document, you need IndexWriter:

IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

任何文档(在您的情况下为文本文件)都是一组字段.要创建包含文件信息的文档,请使用以下命令:

Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:

Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc);  // writing new document to the index

Field 构造函数采用字段名称、文本和至少 2 个参数.首先是一个标志,显示 Lucene 是否必须存储该字段.如果它等于 Field.Store.YES 您将有可能从索引中获取所有文本,否则只会存储有关它的索引信息.
第二个参数显示 Lucene 是否必须索引该字段.将 Field.Index.ANALYZED 用于您要搜索的任何字段.
通常,您使用如上所示的两个参数.

Field constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED for any field you are going to search on.
Normally, you use both parameters as shown above.

别忘了在工作完成后关闭你的 IndexWriter:

Don't forget to close your IndexWriter after the job is done:

writer.close();

搜索有点棘手.您将需要几个类:QueryQueryParser 从字符串中进行 Lucene 查询,IndexSearcher 用于实际搜索,TopScoreDocCollector 存储结果(它作为参数传递给 IndexSearcher)和 ScoreDoc 迭代结果.下一个片段显示了这一切是如何组成的:

Searching is a bit tricky. You will need several classes: Query and QueryParser to make Lucene query from the string, IndexSearcher for actual searching, TopScoreDocCollector to store results (it is passed to IndexSearcher as a parameter) and ScoreDoc to iterate through results. Next snippet shows how this all is composed:

IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
    System.out.println("Title: " + hitDoc.get("title"));
    System.out.println("Content: " + hitDoc.get("content"));
    System.out.println();
}

注意 QueryParser 构造函数的第二个参数 - 它是默认字段,即如果没有给出限定符则将搜索的字段.例如,如果您的查询是title:term",Lucene 将在所有文档的title"字段中搜索单词term",但如果您的查询只是term",则在默认字段中搜索,在这种情况下- 内容".有关详细信息,请参阅 Lucene 查询语法.
QueryParser 也将分析器作为最后一个参数.这必须与您用于索引文本的分析器相同.

Note second argument to the QueryParser constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser also takes analyzer as a last argument. This must be same analyzer as you used to index your text.

您必须知道的最后一件事是 TopScoreDocCollector.create 第一个参数.它只是一个数字,表示您要收集多少个结果.例如,如果它等于 100,Lucene 将只收集第一个(按分数)100 个结果并丢弃其余的.这只是一种优化行为——你收集了最好的结果,如果你对它不满意,你就用更大的数字重复搜索.

The last thing you must know is a TopScoreDocCollector.create first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.

最后,不要忘记关闭搜索器和目录以免丢失系统资源:

Finally, don't forget to close searcher and directory to not loose system resources:

searcher.close();
directory.close();

另见 IndexFiles 演示类,来自 Lucene 3.0 源代码一个>.

Also see IndexFiles demo class from Lucene 3.0 sources.

相关文章