如何在 Lucene 中仅标记某些单词

2022-01-15 00:00:00 dictionary tokenize lucene java

我在我的项目中使用 Lucene,我需要一个自定义分析器.

I'm using Lucene for my project and I need a custom Analyzer.

代码是:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

      Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
      TokenStream filter = new StandardFilter( Version.LUCENE_48, source );

      filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );

      return new TokenStreamComponents( source, filter );
}

}

我已经建立了它,但现在我无法继续.我的需求是过滤器必须只选择某些单词.与使用停用词相比,就像一个相反的过程:不要从词表中删除,而只添加词表中的术语.就像一个预建的字典.所以 StopFilter 不会填充目标.Lucene 提供的过滤器似乎都不是很好.我想我需要编写自己的过滤器,但不知道如何.

I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.

有什么建议吗?

推荐答案

你可以从 StopFilter 开始,所以 阅读源代码!

You're right to look to StopFilter for a starting point, so read the source!

StopFilter 的大部分源代码都是用于构建 stopset 的便捷方法.您可以放心地忽略所有这些(除非您想保留它以构建您的保留集).

Most of StopFilter's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).

去掉所有这些,StopFilter 归结为:

Cut all that, and StopFilter boils down to:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

FilteringTokenFilter 是一个很容易实现的类.关键就是 accept 方法.当为当前术语调用它时,如果它返回 true,则将该术语添加到输出流中.如果返回 false,则丢弃当前术语.

FilteringTokenFilter is a pretty simple class to implement. The key is just the accept method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.

所以您真正需要在 StopFilter 中更改的唯一一件事就是删除单个字符,以使 accept 返回与什么相反的内容目前确实如此.在这里和那里改几个名字也没什么坏处.

So the only thing you really need to change in StopFilter is to delete a single character, to make accept return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

相关文章