弹性搜索中的模式匹配?

2022-01-15 00:00:00 elasticsearch lucene java

继续我之前的 post,我已根据 femtoRgon 的 post 弹性搜索不支持某些字符和锚点.

Continuing from my earlier post, I have changed the query as according to femtoRgon's post some characters and anchors are not supported by elastic search.

我正在寻找匹配xxx-xx-xxxx"等模式的方法,以便使用 elastic search 查找带有社会安全号码的文档.

I am looking the way to match the pattern like "xxx-xx-xxxx" in order to look for documents with social security numbers using elastic search.

假设,在索引文档中,我想查找所有那些社会安全号码与xxx-xx-xxxx"模式匹配的文档.

Let’s suppose, in indexed documents, I would like to find all those documents that has social security numbers that matches "xxx-xx-xxxx" pattern.

索引文档的示例代码:

InputStream is = null;
    try {
      is = new FileInputStream("/home/admin/Downloads/20121221.doc");
      ContentHandler contenthandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      Parser parser = new AutoDetectParser();
      parser.parse(is, contenthandler, metadata, new ParseContext());
      }
    catch (Exception e) {
      e.printStackTrace();
    }
    finally {
        if (is != null) is.close();
    } 

搜索示例代码

QueryBuilder queryBuilderFullText = null;
queryBuilderFullText = QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(),
                        FilterBuilders.regexpFilter("_all", "[0-9]{3}?[0-9]{2}?[0-9]{4}"));
SearchRequestBuilder requestBuilder;
            requestBuilder = client.prepareSearch()
                    .setIndices(getDomainIndexId(project))
                    .setTypes(getProjectTypeId(project))
                    .setQuery(queryBuilderFullText);
SearchResponse response = requestBuilder.execute().actionGet(ES_TIMEOUT_MS);
            SearchHits hits = response.getHits();
if (hits.getTotalHits() > 0) {
System.out.println(hits.getTotalHits());
 } else {
                return 0l;  
        }

我因关注而受到关注:

45-555-5462
457-55-5462
4578-55-5462
457-55-54623
457-55-5462-23

但根据我的要求,它应该只返回457-55-5462"(基于模式匹配xxx-xx-xxxx").

But as per my requirement, it should only return "457-55-5462" (based on pattern matching "xxx-xx-xxxx").

请帮忙.

推荐答案

看到 ^, $d 不能用过,我会这样做:

Seeing as ^, $ and d can't be used, I would do this:

[^0-9-][0-9]{3}-[0-9]{2}-[0-9]{4}[^0-9-]

或者在 Java 中:

Or in Java:

FilterBuilders.regexpFilter("_all", "[^0-9-][0-9]{3}-[0-9]{2}-[0-9]{4}[^0-9-]"));

检查找到的数字之前或之后是否没有其他数字或破折号.它确实需要在匹配之前和之后有 some 字符,因此这不会捕获将社会安全号码作为 very beginning 或 very结束.

Which checks that before or after the found number are no other numbers or dashes. It does require there be some character before and after the match though, so this won't capture documents that have the social security number as the very beginning or very end.

Regex101 演示

相关文章