是否有适用于 Lucene 的快速、准确的荧光笔?

2022-01-15 00:00:00 lucene java

我一直在使用 (Java) Lucene 的荧光笔(在 Sandbox 包中)一段时间.但是,在匹配搜索结果中的正确词时,这并不是很准确 - 它适用于简单的查询,例如搜索两个单独的词会在结果中突出显示两个代码片段.

I've been using the (Java) Highlighter for Lucene (in the Sandbox package) for some time. However, this isn't really very accurate when it comes to matching the correct terms in search results - it works well for simple queries, for example searching for two separate words will highlight both code fragments in the results.

但是,它不适用于更复杂的查询.在最简单的情况下,诸如Stack Overflow"之类的短语查询将匹配突出显示中出现的所有 Stack 或 Overflow,这会给用户一种效果不佳的印象.

However, it doesn't act well with more complicated queries. In the simplest case, phrase queries such as "Stack Overflow" will match all occurrences of Stack or Overflow in the highlighting, which gives the impression to the user that it isn't working very well.

我尝试在 here 应用修复程序,但它来了有很多性能警告,最终根本无法使用.性能尤其是通配符查询的问题.这是由于突出显示的工作方式;而不是只处理查询字符串和文本,它会像 Lucene 那样解析它,然后查找 Lucene 所做的所有匹配;不幸的是,这意味着对于某些通配符查询,它可能会在大型文档中查找 2000 多个子句的匹配项,而且速度还不够快.

I tried applying the fix here but that came with a lot of performance caveats, and at the end of the day was just plain unusable. The performance is especially an issue on wildcard queries. This is due to the way that the highlighting works; instead of just working on the querystring and the text it parses it as Lucene would and then looks for all the matches that Lucene has made; unfortunately this means that for certain wildcard queries it can be looking for matches to 2000+ clauses on large documents, and it's simply not fast enough.

有没有更快的实现准确的荧光笔?

Is there any faster implementation of an accurate highlighter?

推荐答案

有一个新的更快的荧光笔(需要修补,但将是 2.9 版本的一部分)

There is a new faster highlighter (needs to be patched in but will be part of release 2.9)

https://issues.apache.org/jira/browse/LUCENE-1522

还有一个回溯这个问题

相关文章