如何使用 QueryParser 执行包含特殊字符的 lucene 查询?

2022-01-15 00:00:00 lucene java

事情就是这样.我有一个存储在索引中的词,其中包含特殊字符,例如'-',最简单的代码是这样的:

Here is the thing. I have a term stored in the index, which contains special character, such as '-', the simplest code is like this:

Document doc = new Document();
doc.add(new TextField("message", "1111-2222-3333", Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

然后我使用 QueryParser 创建一个查询,如下所示:

And then I create a query using QueryParser, like this:

String queryStr = "1111-2222-3333";
QueryParser parser = new QueryParser(Version.LUCENE_36, "message", new StandardAnalyzer(Version.LUCENE_36));
Query q = parser.parse(queryStr);

然后我使用搜索器搜索查询并没有得到任何结果.我也试过这个:

And then I use a searcher to search the query and get no result. I have also tried this:

Query q = parser.parse(QueryParser.escape(queryStr));

仍然没有结果.

不使用 QueryParser 而是直接使用 TermQuery 可以做我想做的事,但是这种方式对于用户输入文本不够灵活.

Without using QueryParser and instead using TermQuery directly can do what I want, but this way is not flexible enough for user input texts.

我想也许 StandardAnalyzer 做了一些事情来省略查询字符串中的特殊字符.试了debug,发现字符串被拆分,实际查询是这样的:message:1111 message:2222 message:3333".不知道lucene到底做了什么……

I think maybe the StandardAnalyzer did something to omit the special character in the query string. I tried debug, and I found that the string is splited and the actual query is like this:"message:1111 message:2222 message:3333". I don't know what exactly lucene has done...

所以如果我想用特殊字符执行查询,我应该怎么做?我应该重写分析器还是从默认的继承查询分析器?以及如何?...

So if I want to perform the query with special character, what should I do? Should I rewrite an analyzer or inherit a queryparser from the default one? And how to?...

更新:

1 @The New Idiot @femtoRgon,我已经尝试了问题中所述的 QueryParser.escape(queryStr),但它仍然不起作用.

1 @The New Idiot @femtoRgon, I've tried QueryParser.escape(queryStr) as stated in the problem but it still doesn't work.

2 我尝试了另一种解决问题的方法.我从Tokenizer派生了一个QueryTokenizer,只用空格截取单词,打包成一个QueryAnalyzer,它派生自Analyzer,最后将QueryAnalyzer传递给QueryParser.

2 I've tried another way to solve the problem. I derived a QueryTokenizer from Tokenizer and cut the word only by space, pack it into a QueryAnalyzer, which derives from Analyzer, and finally pass the QueryAnalyzer into QueryParser.

现在可以了.最初它不起作用,因为默认的 StandardAnalyzer 根据默认规则(将某些特殊字符识别为拆分器)切割 queryStr,当查询传递到 QueryParser 时,特殊字符已经被 StandardAnalyzer 删除.现在我使用我自己的方式来剪切 queryStr 并且它只将空格识别为分隔符,因此特殊字符保留在查询中等待处理,这很有效.

Now it works. Originally it doesn't work because the default StandardAnalyzer cut the queryStr according to default rules(which recognize some of the special characters as splitters), when the query is passed into QueryParser, the special characters are already deleted by StandardAnalyzer. Now I use my own way to cut the queryStr and it only recognize space as splitter, so the special characters remain into the query waiting for processing and this works.

3 @The New Idiot @femtoRgon,感谢您回答我的问题.

3 @The New Idiot @femtoRgon, thank you for answering my question.

推荐答案

我不确定这个,但我猜你需要用 转义 - .根据 Lucene 文档.

I am not sure about this , but I guess you need to escape - with . As per the Lucene docs.

-"或禁止运算符排除包含-"之后的术语的文档.符号.

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.

再次,

Lucene 支持对属于查询语法一部分的特殊字符进行转义.当前列表特殊字符为

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - &&||!( ) { } [ ] ^ "~ * ?: /

+ - && || ! ( ) { } [ ] ^ " ~ * ? : /

要转义这些字符,请在字符前使用 .

To escape these character use the before the character.

另外请记住,如果某些字符在 Java 中具有特殊含义,则需要转义两次.

Also remember, some characters you'll need to escape twice if they have special meaning in Java.

相关文章