在 Apache 的 Lucene 中使用默认和自定义停用词(奇怪的输出)
我正在使用 Apache 的 Lucene (8.6.3) 和以下 Java 8 代码:
I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:
private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
try {
Analyzer analyzer = new StandardAnalyzer(stopSet);
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.print("[" + term.toString() + "] ");
}
tokenStream.close();
analyzer.close();
} catch (IOException e) {
System.out.println("Exception:
");
e.printStackTrace();
}
这会输出期望的结果:
[这个] [是] [a] [bla]
[this] [is] [a] [bla]
现在我想同时使用默认的英语停止设置,它还应该删除this"、is".和一"(根据 github) 和上面设置的自定义停止(我要使用的实际要长得多),所以我尝试了这个:
Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:
Analyzer analyzer = new EnglishAnalyzer(stopSet);
输出是:
[thi] [是] [a] [bla]
[thi] [is] [a] [bla]
是的,s"在这个"中不见了.这是什么原因造成的?它也没有使用默认的停止设置.
Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.
以下更改同时删除了默认和自定义停用词:
The following changes remove both the default and the custom stop words:
Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);
问题:什么是权利"?怎么做?在自身内部使用 tokenStream
(参见上面的代码)会导致问题吗?
Question: What is the "right" way to do this? Is using the tokenStream
within itself (see code above) going to cause problems?
额外问题:我如何以正确的大写/小写输出剩余的单词,因此它们在原始文本中使用什么?
Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?
推荐答案
我将分两部分解决这个问题:
I will tackle this in two parts:
- 停用词
- 保留原案
处理组合停用词
要处理 Lucene 的英文停用词列表,加上自己的自定义列表,可以创建一个合并列表,如下所示:
To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:
import org.apache.lucene.analysis.en.EnglishAnalyzer;
...
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
上面的代码只是将 Lucene 捆绑的英文停用词合并到你的列表中.
The above code simply takes the English stopwords bundled with Lucene and merges then with your list.
给出以下输出:
[bla]
处理单词大小写
这涉及更多.正如您所注意到的,StandardAnalyzer
包含一个将所有单词转换为小写的步骤 - 所以我们不能使用它.
This is a bit more involved. As you have noticed, the StandardAnalyzer
includes a step in which all words are converted to lower case - so we can't use that.
另外,如果您想维护自己的自定义停用词列表,并且该列表大小不限,我建议您将其存储在自己的文本文件中,而不是将列表嵌入到您的代码中.
Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.
因此,假设您有一个名为 stopwords.txt
的文件.在此文件中,每行一个单词 - 该文件已包含您的自定义停用词的合并列表和英文停用词的官方列表.
So, let's assume you have a file called stopwords.txt
. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.
您需要自己手动准备此文件(即忽略此答案第 1 部分中的注释).
You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).
我的测试文件是这样的:
My test file is just this:
short
this
is
a
test
the
him
it
我也更喜欢将 CustomAnalyzer
用于类似的事情,因为它让我可以非常简单地构建分析器.
I also prefer to use the CustomAnalyzer
for something like this, as it lets me build an analyzer very simply.
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
执行以下操作:
它使用icu"标记器
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
,负责对 Unicode 空白进行标记,并处理标点符号.
It uses the "icu" tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
, which takes care of tokenizing on Unicode whitespace, and handling punctuation.
它应用了停用词列表.注意 true
对 ignoreCase
属性的使用,以及对停用词文件的引用.wordset
的格式表示每行一个单词".(还有其他格式).
It applies the stopword list. Note the use of true
for the ignoreCase
attribute, and the reference to the stop-word file. The format of wordset
means "one word per line" (there are other formats, also).
这里的关键是上面的链中没有任何改变单词大小写的东西.
The key here is that there is nothing in the above chain which changes word case.
所以,现在,使用这个新的分析器,输出如下:
So, now, using this new analyzer, the output is as follows:
[Bla]
最后说明
你把停止列表文件放在哪里?默认情况下,Lucene 期望在应用程序的类路径中找到它.因此,例如,您可以将其放在默认包中.
Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.
但请记住,该文件需要由您的构建过程处理,以便它与应用程序的类文件一起结束(而不是与源代码一起留下).
But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).
我主要使用 Maven - 因此我在我的 POM 中有这个以确保.txt"文件根据需要部署:
I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
</build>
这告诉 Maven 将文件(Java 源文件除外)复制到构建目标 - 从而确保复制文本文件.
This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.
最后一点 - 我没有调查你为什么会得到那个被截断的 [thi]
令牌.如果有机会我会仔细看看.
Final note - I did not investigate why you were getting that truncated [thi]
token. If I get a chance I will take a closer look.
后续问题
合并后我必须使用 StandardAnalyzer,对吧?
是的,没错.我在答案的第 1 部分中提供的注释与您问题中的代码以及您使用的 StandardAnalyzer 直接相关.
Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.
我想将停用词文件保留在特定的非导入路径上 - 怎么做?
您可以告诉 CustomAnalyzer 在资源"中查找.停用词文件的目录.该目录可以位于文件系统上的任何位置(如您所述,便于维护):
You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path resources = Paths.get("/path/to/resources/directory");
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
我们现在使用 .builder(resources)
,而不是使用 .builder()
.
Instead of using .builder()
we now use .builder(resources)
.
相关文章