如何从 Lucene TokenStream 中获取 Token?
我正在尝试使用 Apache Lucene 进行标记,我对从 TokenStream
获取令牌的过程感到困惑.
I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream
.
最糟糕的是,我正在查看 JavaDocs 中解决我问题的评论.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
不知何故,应该使用 AttributeSource
,而不是 Token
.我完全不知所措.
Somehow, an AttributeSource
is supposed to be used, rather than Token
s. I'm totally at a loss.
谁能解释如何从 TokenStream 中获取类似令牌的信息?
Can anyone explain how to get token-like information from a TokenStream?
推荐答案
是的,这有点复杂(与好方法相比),但应该这样做:
Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
新方式
根据 Donotello 的说法,TermAttribute
已被弃用,取而代之的是 CharTermAttribute
.根据 jpountz(和 Lucene 的文档),addAttribute
比 getAttribute
更可取.
The new way
According to Donotello, TermAttribute
has been deprecated in favor of CharTermAttribute
. According to jpountz (and Lucene's documentation), addAttribute
is more desirable than getAttribute
.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
相关文章