具有频繁更新索引的 FieldCache

2022-01-15 00:00:00 lucene c# .net java lucene.net

你好
我有经常使用新记录更新的 lucene 索引,我的索引中有 5,000,000 条记录,并且我正在使用 FieldCache 缓存我的数字字段之一.但是在更新索引后,再次重新加载 FieldCache 需要时间(我正在重新加载缓存,因为文档说 DocID 不可靠)所以我怎样才能通过仅将新添加的 DocID 添加到 FieldCache 来最小化这种开销,导致此功能成为我的瓶颈应用.

Hi
I have lucene index that is frequently updating with new records, I have 5,000,000 records in my index and I'm caching one of my numeric fields using FieldCache. but after updating index it takes time to reload the FieldCache again (im reloading the cache cause documentation said DocID is not reliable) so how can I minimize this overhead by adding only newly added DocIDs to the FieldCache, cause this capability turns to bottleneck in my application.


IndexReader reader = IndexReader.Open(diskDir);
int[] dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This line takes 4 seconds to load the array
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // this line takes 0 second as we expected
// HERE we add some document to index and we need to reload the index to reflect changes

reader = reader.Reopen();
dateArr = FieldCache_Fields.DEFAULT.GetInts(reader, "newsdate"); // This takes 4 second again to load the array

我想要一种通过仅将新添加的文档添加到我们数组中的索引来最小化此时间的机制,有一种像这样的技术 http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html为了提高性能,但它仍然会加载我们已经拥有的所有文档,我认为如果我们找到一种仅将新添加的文档添加到数组中的方法,则无需重新加载它们

I want a mechanism that minimize this time by adding only newly added documents to the index in our array there is a technique like this http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html to improve the performance but it still load all documents that we already have and i think there is no need to reload them all if we find a way to only adding newly added documents to the array

推荐答案

FieldCache 使用对索引读取器的弱引用作为其缓存的键.(通过调用未过时的 IndexReader.GetCacheKey.)使用 FSDirectoryIndexReader.Open 的标准调用将使用读者,每个部分都有一个.

The FieldCache uses weak references to index readers as keys for their cache. (By calling IndexReader.GetCacheKey which has been un-obsoleted.) A standard call to IndexReader.Open with a FSDirectory will use a pool of readers, one for every segment.

您应该始终将最里面的阅读器传递给 FieldCache.查看 ReaderUtil 以获取一些帮助内容,以检索包含文档的单个阅读器.文档 ID 不会在一个段内更改,当将其描述为不可预测/易失性时,它们的意思是它将在两个索引提交之间更改.已删除的文档可能已被删除,段已被合并,以及此类操作.

You should always pass the innermost reader to the FieldCache. Check out ReaderUtil for some helper stuff to retrieve the individual reader a document is contained within. Document ids wont change within a segment, what they mean when describing it as unpredictable/volatile is that it will change between two index commits. Deleted documents could have been proned, segments have been merged, and such actions.

提交需要从磁盘中删除段(合并/优化掉),这意味着新的读取器不会拥有池化的段读取器,并且垃圾收集会在所有旧读取器关闭后立即将其删除.

A commit needs to remove the segment from disk (merged/optimized away), which means that new readers wont have the pooled segment reader, and the garbage collection will remove it as soon as all older readers are closed.

永远不要调用 FieldCache.PurgeAllCaches().它用于测试,而不是生产用途.

Never, ever, call FieldCache.PurgeAllCaches(). It's meant for testing, not production use.

添加于 2011-04-03;使用子阅读器的示例代码.

Added 2011-04-03; example code using subreaders.

var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;

// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);

// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
    if (sub.MaxDoc() < subReaderId) {
        subReaderId -= sub.MaxDoc();
        return false;
    }

    return true;
});

var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];

相关文章