“-"的 Lucene 索引问题特点

2022-01-15 00:00:00 indexing escaping character lucene java

我在使用 Lucene 索引时遇到问题，该索引的索引词包含-"字符.

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.

它适用于某些包含-"的单词，但不适用于所有单词，我找不到原因，为什么它不起作用.

It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.

我正在搜索的字段经过分析并包含带有和不带有-"字符的单词的版本.

The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.

我正在使用分析器:org.apache.lucene.analysis.standard.StandardAnalyzer

I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer

这里是一个例子:

如果我搜索gsx-*"我得到一个结果，索引字段包含铃木 GSX-R 1000 GSX-R1000 GSXR"

if I search for "gsx-*" I got a result, the indexed field contains "SUZUKI GSX-R 1000 GSX-R1000 GSXR"

但如果我搜索v-*"，我没有得到任何结果.预期结果的索引字段包含:铃木 DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"

but if I search for "v-*" I got no result. The indexed field of the expected result contains: "SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"

如果我在没有*"的情况下搜索v-strom"，它可以工作，但如果我只搜索v-str"，例如我不会得到结果.(应该有结果，因为它是针对网上商店的实时搜索)

If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)

那么，两个预期结果之间有什么区别?为什么它适用于gsx-"而不适用于v-"?

So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

推荐答案

我相信，StandardAnalyzer 会将连字符视为空格.所以它把你的查询 "gsx-*" 变成 "gsx*" 和 "v-*" 变成空，因为 at 也消除了单字母令牌.您在搜索结果中看到的字段内容是该字段的存储值，它完全独立于为该字段编制索引的术语.

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.

所以你想要的是将v-strom"作为一个整体作为一个索引词.StandardAnalyzer 不适合这种文本.也许可以试试 WhitespaceAnalyzer 或 SimpleAnalyzer.如果这仍然不能解决问题，您还可以选择将自己的分析器放在一起，或者只是从这两个开始并使用进一步的 TokenFilters 组合它们.theLucene 分析包 Javadoc.

So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.

顺便说一句，不需要在索引中输入所有变体，例如 V-strom、V-Strom 等.这个想法是让同一个分析器在索引中和解析时将所有这些变体标准化为相同的字符串查询.

BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

相关文章