Lucene 中的 {Filter} 比 {Query} 快吗?

2022-01-15 00:00:00 lucene java

在阅读Lucene in Action 2nd edition"时,我遇到了 Filter 类的描述,这些类可用于 Lucene 中的结果过滤.Lucene 有很多过滤器重复 Query 类.例如,NumericRangeQueryNumericRangeFilter.

While reading "Lucene in Action 2nd edition" I came across the description of Filter classes which are could be used for result filtering in Lucene. Lucene has a lot of filters repeating Query classes. For example, NumericRangeQuery and NumericRangeFilter.

这本书说 NRFNRQ 完全相同,但没有文档评分.这是否意味着如果我不需要评分或按文档字段值对文档进行排序,我应该更喜欢Filtering而不是Query

The book says that NRF does exactly the same as NRQ but without document scoring. Does this means that if I do not need scoring or sort documents by document field value I should prefer Filtering over Querying from performance point of view?

推荐答案

我收到了 Uwe Schindler 的一个很好的回答,让我在这里重新发布.

I receive a great answer from Uwe Schindler, let me repost it here.

如果你不缓存过滤器,查询会更快,因为 ConjunctionScorer在 Lucene 中有优化,目前还没有用于过滤器.过滤器很好,如果你缓存它们(例如,如果你总是有相同的访问权限特定用户的所有查询的限制).在在这种情况下,过滤器只执行一次并被进一步缓存请求,然后与查询结果集相交.

If you dont cache filters, queries will be faster, as the ConjunctionScorer in Lucene has optimizations, which are currently not used for Filters. Filters are fine, if you cache them (e.g. if you always have the same access restrictions for a specific user that are applied to all his queries). In that case the Filter is only executed once and cached for all further requests and then intersected with the query result set.

如果你只想随机过滤",例如通过可变数值范围就像地理搜索中的边界框一样,使用查询,查询在大多数案例更快(例如范围查询和类似的东西 - 称为 MultiTermQueries- 在内部也由相同的 BitSet 算法实现,如过滤器 - 实际上它们只是被记分器-impl 包装的过滤器).但是将查询和您的过滤器"查询组合在一起的记分器(ConjunctionScorer) 通常比应用搜索后过滤.这可能会有所改进,但总的来说过滤器是 Lucene 中不再需要的东西,所以有已经有一些方法可以使过滤器和查询相同,并且而是能够缓存非评分查询.这会让很多代码更容易.

If you only want to e.g. randomly "filter" e.g. by a variable numeric range like a bounding box in a geographic search, use queries, queries are in most cases faster (e.g. Range Queries and similar stuff - called MultiTermQueries - are internally also implemented by the same BitSet algorithm like the Filter - in fact they are only Filters wrapped by a Scorer-impl). But the Scorer that ANDs the query and your "filter" query together (ConjunctionScorer) is generally faster than the code that applies the filter after searching. This may some improvement possible, but in general filters are something in Lucene that is not really needed anymore, so there were already some approaches to make Filters and Queries the same, and instead then be able to also cache non-scoring queries. This would make lots of code easier.

过滤器可以在 Lucene 4.0 中带来巨大的速度提升,如果它们是插入 IndexReader 以在 评分之前过滤文档,但这还没有实现(见https://issues.apache.org/jira/browse/LUCENE-3212) - 我正在工作在上面.我们也可以使过滤器随机访问(很容易,因为它们是位集),这还可以改进查询后过滤.但我也会做查询部分随机访问,如果他们可以支持的话(比如查询仅基于 FieldCache).

Filters can bring a huge speed improvement with Lucene 4.0, if they are plugged ontop of the IndexReader to filter the documents before scoring, but that's not yet implemented (see https://issues.apache.org/jira/browse/LUCENE-3212) - I am working on it. We may also make Filters random access (it's easy as they are bitsets), which could improve also the after-query filtering. But I would then also make Queries partially random access, if they could support it (like queries that are only based on FieldCache).

呜呜

相关文章