使用 Lucene 和 Java 标记、删除停用词

2022-01-15 00:00:00 nlp tokenize lucene java stop-words

我正在尝试使用 Lucene 从 txt 文件中标记和删除停用词.我有这个:

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { if (sb.length() > 0) { sb.append(" "); } sb.append(token.toString()); System.out.println(sb); } return sb.toString(); }}

我的主要看起来像这样:

My main looks like this:

String file = "..../datatest.txt"; TestFileReader fr = new TestFileReader(); fr.imports(file); System.out.println(fr.content); String text = fr.content; Stopwords stopwords = new Stopwords(); stopwords.removeStopWords(text); System.out.println(stopwords.removeStopWords(text));

这给了我一个错误，但我不知道为什么.

This is giving me an error but I can't figure out why.

推荐答案

我遇到了同样的问题.要使用 Lucene 删除停用词，您可以使用方法 EnglishAnalyzer.getDefaultStopSet(); 使用它们的默认停止集.否则，您可以创建自己的自定义停用词列表.

I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

下面的代码显示了 removeStopWords() 的正确版本:

The code below shows the correct version of your removeStopWords():

public static String removeStopWords(String textFile) throws Exception { CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim())); tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = charTermAttribute.toString(); sb.append(term + " "); } return sb.toString(); }

要使用自定义停用词列表，请使用以下内容:

To use a custom list of stop words use the following:

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set final List<String> stop_Words = Arrays.asList("fox", "the"); final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

相关文章