lucene 中的高光性能非常慢
Lucene (4.6) 荧光笔在搜索常用词时性能非常慢.搜索速度很快(100 毫秒),但突出显示可能需要一个多小时(!).
Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).
详细信息: 使用了很棒的文本语料库(1.5GB 纯文本).性能不取决于文本是否被分成更多的小块.(也用 500MB 和 5MB 块进行了测试.)存储位置和偏移量.如果搜索一个非常频繁的术语或模式,TopDocs 检索速度很快(100 毫秒),但每个searcher.doc(id)"调用都很昂贵(5-50 秒),getBestFragments() 非常昂贵(超过 1 小时).甚至它们也为此目的被存储和索引.(硬件:core i7、8GM mem)
Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)
更大的背景:它将服务于语言分析研究.使用了一种特殊的词干提取:它也存储词性信息.例如,如果 "adj adj adj adj noun" 被搜索,它会给出它在文本中出现的所有内容.
Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.
我可以调整它的性能,还是应该选择其他工具?
使用代码:
//indexing
FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
offsetsType.setStored(true);
offsetsType.setIndexed(true);
offsetsType.setStoreTermVectors(true);
offsetsType.setStoreTermVectorOffsets(true);
offsetsType.setStoreTermVectorPositions(true);
offsetsType.setStoreTermVectorPayloads(true);
doc.add(new Field("content", fileContent, offsetsType));
//quering
TopDocs results = searcher.search(query, limitStart+limit);
int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
int startPos = Math.min(results.scoreDocs.length, limitStart);
for (int i = startPos; i < endPos; i++) {
int id = results.scoreDocs[i].doc;
// bottleneck #1 (5-50s):
Document doc = searcher.doc(id);
FastVectorHighlighter h = new FastVectorHighlighter();
// bottleneck #2 (more than 1 hour):
String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);
相关(未回答)问题:https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting
推荐答案
BestFragments 依赖于您正在使用的分析器完成的标记化.如果要分析这么大的文本,最好在索引时存储词向量WITH_POSITIONS_OFFSETS
.
BestFragments relies on the tokenization done by the analyzer that you're using. If you have to analyse such a big text, you'd better to store term vector WITH_POSITIONS_OFFSETS
at indexing time.
请阅读这个和这本书
通过这样做,您无需在运行时分析所有文本,因为您可以选择一种方法来重用现有术语向量,这将减少突出显示的时间.
By doing that, you won't need to analyze all the text at runtime as you can pick up a method to reuse the existing term vector and this will reduce the highlighting time.
相关文章