在没有索引的情况下使用 Lucene Analyzer - 我的方法合理吗?

2022-01-15 00:00:00 lucene java

我的目标是利用 Lucene 的许多标记器和过滤器来转换输入文本,但不创建任何索引.

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.

例如,给定这个(人为的)输入字符串...

For example, given this (contrived) input string...

" 某人的 - [texté] 在这里,foo ."

...还有像这样的 Lucene 分析器...

...and a Lucene analyzer like this...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();

我想得到以下输出:

某人的文本在这里 foo

下面的 Java 方法可以满足我的需求.

The below Java method does what I want.

但有没有更好(即更典型和/或更简洁)的方式让我这样做?

我特别想的是我使用 TokenStreamCharTermAttribute 的方式,因为我以前从未像这样使用过它们.感觉很笨重.

I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.

代码如下:

Lucene 8.3.0 导入:

Lucene 8.3.0 imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;

我的方法:

private String transform(String input) throws IOException {

    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();

    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);

    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}

推荐答案

我已经使用这个设置几个星期了,没有问题.我还没有找到更简洁的方法.我认为问题中的代码是可以的.

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

相关文章