如何从 Lucene 中的文档术语向量中获取位置?

2022-01-15 00:00:00 lucene java

我需要遍历 Lucene 索引中的所有文档，并获取每个术语在每个文档中出现的位置.据我能够从 Lucene javadoc 中了解到，这样做的方法是执行以下操作:

I need to iterate over all documents in a Lucene index, and obtain the positions at which each term occurs in each document. As far as I am able to understand from the Lucene javadoc, the way to do this is to do something like this:

IndexReader ir = obtainIndexReader(); Terms tv = ir.getTermVector( doc, field ); TermsEnum terms = tv.iterator(); PostingsEnum p = null; while( terms.next() != null ) { p = terms.postings( p, PostingsEnum.ALL ); while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) { int freq = p.freq(); for( int i = 0; i < freq; i++ ) { int pos = p.nextPosition(); // Always returns -1!!! BytesRef data = p.getPayload(); doStuff( freq, pos, data ); // Fails miserably, of course. } } }

但是，即使 (1) 索引确实包含相关字段上的位置，并且 (2) 术语向量声称具有位置(即:tv.hasPositions() == true)，我仍然得到-1" 适用于所有职位.

However, even though (1) the index does indeed include positions on the relevant field and (2) the term vector claims to have positions (i.e.: tv.hasPositions() == true), I keep getting "-1" for all positions.

首先，我是不是做错了什么?是否有另一种方法可以在每个文档的基础上迭代过帐?第二:到底发生了什么?该索引包含位置，getTermVector 返回的术语实例声称包含位置，并且我正在查看 Luke 中的正确位置值，但是当我尝试在我的代码中访问所述值时仍然得到 -1.什么给了?

First, am I doing something wrong? Is there an alternative way of iterating over postings on a per-document basis? Second: What is going on anyway? The index contains positions, the Terms instance returned by getTermVector claims to include positions, and I'm looking at the correct position values in Luke, yet I still get -1 when I try to access said values in my code. What gives?

相关字段配置有以下选项:

The relevant field was configured with the following options:

FieldType ft = new FieldType(); ft.setIndexOptions( IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS ); ft.setStoreTermVectors( true ); ft.setStoreTermVectorOffsets( true ); ft.setStoreTermVectorPayloads( true ); ft.setStoreTermVectorPositions( true ); ft.setTokenized( true ); return ft;

推荐答案

您是否在索引时为您的字段类型设置了 FieldType.setStoreTermVectorPositions(true)?http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)

Did you set FieldType.setStoreTermVectorPositions(true) on your field type at index time? http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)

相关文章