Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌

2022-01-15 00:00:00 python indexing lucene apache java

我正在寻找一种使用 Apache Lucene 编写自定义索引的方法(准确地说是 PyLucene，但 Java 的答案很好).

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).

我想做的是:当向索引添加文档时，Lucene 会对其进行标记，删除停用词等.如果我不是，通常使用 Analyzer 来完成搞错了.

What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.

我想要实现的是以下内容:在 Lucene 存储给定术语之前，我想执行查找(例如，在字典中)以检查是否保留该术语或丢弃它(如果该术语存在在我的字典中，我保留它，否则我丢弃它).

What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).

我应该如何进行?

这是(在 Python 中)我对 Analyzer 的自定义实现:

Here is (in Python) my custom implementation of the Analyzer :

class CustomAnalyzer(PythonAnalyzer): def createComponents(self, fieldName, reader): source = StandardTokenizer(Version.LUCENE_4_10_1, reader) filter = StandardFilter(Version.LUCENE_4_10_1, source) filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter) filter = StopFilter(Version.LUCENE_4_10_1, filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET) ts = tokenStream.getTokenStream() token = ts.addAttribute(CharTermAttribute.class_) offset = ts.addAttribute(OffsetAttribute.class_) ts.reset() while ts.incrementToken(): startOffset = offset.startOffset() endOffset = offset.endOffset() term = token.toString() # accept or reject term ts.end() ts.close() # How to store the terms in the index now ? return ????

提前感谢您的指导！

EDIT 1:深入研究 Lucene 的文档后，我认为它与 TokenStreamComponents 有关.它返回一个 TokenStream，您可以使用它来遍历您正在索引的字段的 Token 列表.

EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.

现在我不明白与 Attributes 有什么关系.或者更准确地说，我可以读取令牌，但不知道接下来应该如何进行.

Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.

编辑 2:我发现了这个 post 他们提到了 CharTermAttribute 的使用.但是(尽管在 Python 中)我无法访问或获取 CharTermAttribute.有什么想法吗?

EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?

EDIT3:我现在可以访问每个术语，请参阅更新代码片段.现在剩下要做的实际上是存储所需的术语...

EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

推荐答案

我试图解决问题的方法是错误的.这个帖子和femtoRgon 的答案就是解决方案.

The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.

通过定义一个扩展 PythonFilteringTokenFilter 的过滤器，我可以利用函数 accept()(在 StopFilter 中使用的那个)实例).

By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).

下面是对应的代码片段:

Here is the corresponding code snippet :

class MyFilter(PythonFilteringTokenFilter): def __init__(self, version, tokenStream): super(MyFilter, self).__init__(version, tokenStream) self.termAtt = self.addAttribute(CharTermAttribute.class_) def accept(self): term = self.termAtt.toString() accepted = False # Do whatever is needed with the term # accepted = ... (True/False) return accepted

然后只需将过滤器附加到其他过滤器(如问题的代码所示):

Then just append the filter to the other filters (as in the code snipped of the question) :

filter = MyFilter(Version.LUCENE_4_10_1, filter)

相关文章