在 Lucene 中使用 WikipediaTokenizer 的示例

2022-01-15 00:00:00 parsing lucene programming-languages java wikipedia

我想在 lucene 项目中使用 WikipediaTokenizer - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html 但我从未使用过 lucene.我只想将维基百科字符串转换为令牌列表.但是，我看到这个类中只有四种方法可用，end、incrementToken、reset、reset(reader).谁能给我举个例子来使用它.

I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.

谢谢.

推荐答案

在 Lucene 3.0 中，next() 方法被移除.现在您应该使用 incrementToken 来遍历令牌，当您到达输入流的末尾时它会返回 false.要获取每个令牌，您应该使用 AttributeSource 类.根据您要获取的属性(术语、类型、有效负载等)，您需要使用 addAttribute 方法将相应属性的类类型添加到您的分词器中.

In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

以下部分代码示例来自WikipediaTokenizer的测试类，您可以在下载Lucene的源代码时找到它.

Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

... WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test)); int count = 0; int numItalics = 0; int numBoldItalics = 0; int numCategory = 0; int numCitation = 0; TermAttribute termAtt = tf.addAttribute(TermAttribute.class); TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class); while (tf.incrementToken()) { String tokText = termAtt.term(); //System.out.println("Text: " + tokText + " Type: " + token.type()); String expectedType = (String) tcm.get(tokText); assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null); assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true); count++; if (typeAtt.type().equals(WikipediaTokenizer.ITALICS) == true){ numItalics++; } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS) == true){ numBoldItalics++; } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true){ numCategory++; } else if (typeAtt.type().equals(WikipediaTokenizer.CITATION) == true){ numCitation++; } } ...

相关文章