将 CharFilter 与 Lucene 4.3.0 的 StandardAnalyzer 一起使用

2022-01-15 00:00:00 lucene java

我正在尝试将 CharFilter 添加到我的 StandardAnalyzer.我的意图是从我索引的所有文本中去掉标点符号;例如,我希望 PrefixQuery "pf" 匹配 "P.F. Chang's" 或 "zaras" 匹配 "Zara's".

I am trying to add a CharFilter to my StandardAnalyzer. My intention is to strip out punctuation from all the text I index; for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".

似乎这里最简单的攻击计划是在分析之前过滤掉所有标点符号.根据 Analyzer 软件包文档,这意味着我应该使用 CharFilter.

It seems that the easiest plan of attack here is to filter out all punctuation before analysis. Per the Analyzer package documentation, that means I should use a CharFilter.

但是,实际上几乎不可能将 CharFilter 插入分析器!

However, it seems next to impossible to actually insert a CharFilter into the analyzer!

Analyzer.initReader 说如果你想插入 CharFilter,请覆盖它".

The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".

如果我的代码扩展了 Analyzer,我可以扩展 initReader,但我不能委托抽象 createComponents 到我的基础 StandardAnalyzer,因为它受到保护.我不能委托 tokenStream 到我的基本分析器,因为它是最终的.因此,Analyzer 的子类似乎无法使用另一个 Analyzer 来完成其肮脏的工作.

If my code extends Analyzer, I can extend initReader but I cannot delegate the abstract createComponents to my base StandardAnalyzer, as it is protected. I cannot delegate tokenStream to my base analyzer, because it is final. So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.

有一个 AnalyzerWrapper 类似乎非常适合我想要的!我可以提供一个基础分析器,并且只覆盖我想要的部分.除了…… initReader 已经被覆盖以委托给基础分析器,并且这个覆盖是最终的"!无赖!

There is an AnalyzerWrapper class that seems perfect for what I want! I can provide a base analyzer and only override the pieces that I want. Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"! Bummer!

我想我可以让我的 Analyzerorg.apache.lucene.analyzers 包中,然后我可以访问受保护的 createComponents方法,但这似乎是绕过我真正应该使用的公共 API 的一种令人作呕的 hacky 方法.

I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.

我在这里错过了什么明显的东西吗?如何修改 StandardAnalyzer 以使用自定义 CharFilter?

Am I missing something glaring here? How can I amend a StandardAnalyzer to use a custom CharFilter?

推荐答案

目的是让您覆盖 Analyzer,而不是 StandardAnalyzer.想法是,您永远不应该将 Analyzer 实现子类化(关于那里的一些讨论 here).虽然分析器实现非常简单,但将 CharFilter 添加到实现与 StandardAnalyzer 相同的标记器/过滤器链的分析器中看起来像:

The intent is for you to override Analyzer, rather than StandardAnalyzer. The thinking is that you should never subclass an Analyzer implementation (some discussion of there here). Analyzer implementations are pretty straightforward though, and adding a CharFilter to an Analyzer implementing the same tokenizer/filter chain as StandardAnalyzer would look something like:

public final class MyAnalyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
        TokenStream tok = new StandardFilter(matchVersion, src);
        tok = new LowerCaseFilter(matchVersion, tok);
        tok = new StopFilter(matchVersion, tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        return new TokenStreamComponents(src, tok);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        //return your CharFilter-wrapped reader here
    }
}

相关文章