lucene - 越接近标题的开头赋予更多的权重
我了解如何在索引时或查询时提升字段.但是,如何提高匹配靠近标题开头的术语的分数?
I understand how to boost fields either at index time or query time. However, how could I increase the score of matching a term closer to the beginning of a title?
例子:
Query = "lucene"
Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"
我希望第一个文档得分更高,因为lucene"更接近开头(暂时忽略词频).
I would like the first document to score higher since "lucene" is closer to the beginning (ignoring term freq for now).
我了解如何使用 SpanQuery 指定术语之间的接近度,但我不确定如何使用有关字段中位置的信息.
I see how to use the SpanQuery for specifying the proximity between terms, but I'm not sure how to use the information about the position in the field.
我在 Java 中使用 Lucene 4.1.
I am using Lucene 4.1 in Java.
推荐答案
我会使用 SpanFirstQuery
,匹配字段开头附近的术语.作为所有跨度查询,它依赖于位置,在 lucene 中进行索引时默认启用.
I would make use of a SpanFirstQuery
, which matches terms near the beginning of a field. As all span queries it relies on positions, enabled by default while indexing in lucene.
让我们独立测试一下:你只需要提供你的 SpanTermQuery
以及可以找到该术语的最大位置(在我的示例中为一个).
Let's test it independently: you just have to provide your SpanTermQuery
and the maximum position where the term can be found (one in my example).
SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);
鉴于您的两个文档,如果您使用 StandardAnalyzer
进行分析,此查询将只找到标题为Lucene: Homepage"的第一个.
Given your two documents this query will find only the first one with title "Lucene: Homepage", if you analyzed it with the StandardAnalyzer
.
现在我们可以以某种方式将上面的 SpanFirstQuery
与普通文本查询结合起来,让第一个查询只影响分数.您可以使用 BooleanQuery 轻松做到这一点
并将 span 查询作为 should 子句,如下所示:
Now we can somehow combine the above SpanFirstQuery
with a normal text query, and have the first one only influencing the score. You can easily do it using a BooleanQuery
and putting the span query as a should clause like this:
Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
可能有不同的方法可以实现相同的目标,也可能使用 CustomScoreQuery
或自定义代码来实现评分,但在我看来这是最简单的方法.
There are probably different ways to achieve the same, maybe using a CustomScoreQuery
too, or custom code to implement the scoring, but this seems to me the easiest one.
我用来测试它的代码打印以下输出(包括分数)首先执行唯一的 TermQuery
,然后是唯一的 SpanFirstQuery
,最后是组合的 BooleanQuery
:
The code I used to test it prints the following output (score included) executing the only TermQuery
first, then the only SpanFirstQuery
and finally the combined BooleanQuery
:
------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242
完整代码如下:
public static void main(String[] args) throws Exception {
Directory directory = FSDirectory.open(new File("data"));
index(directory);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Term term = new Term("title", "lucene");
System.out.println("------ TermQuery --------");
TermQuery termQuery = new TermQuery(term);
search(indexSearcher, termQuery);
System.out.println("------ SpanFirstQuery --------");
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
search(indexSearcher, spanFirstQuery);
System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
search(indexSearcher, booleanQuery);
}
private static void index(Directory directory) throws Exception {
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));
IndexWriter writer = new IndexWriter(directory, config);
FieldType titleFieldType = new FieldType();
titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
titleFieldType.setIndexed(true);
titleFieldType.setStored(true);
Document document = new Document();
document.add(new Field("title","I have a question about lucene", titleFieldType));
writer.addDocument(document);
document = new Document();
document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
writer.addDocument(document);
writer.close();
}
private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("Total hits: " + topDocs.totalHits);
for (ScoreDoc hit : topDocs.scoreDocs) {
Document result = indexSearcher.doc(hit.doc);
for (IndexableField field : result) {
System.out.println(field.name() + ": " + field.stringValue() + " - score: " + hit.score);
}
}
}
相关文章