在 Lucene 中获取词频
有没有一种快速简便的方法从 Lucene 索引中获取词频,而无需通过 TermVectorFrequencies
类来完成,因为对于大型集合来说这需要大量时间?
Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies
class, since that takes an awful lot of time for large collections?
我的意思是,有没有像 TermEnum
这样的东西,它不仅有文档频率,还有词频?
What I mean is, is there something like TermEnum
which has not just the document frequency but term frequency as well?
更新:使用 TermDocs 太慢了.
UPDATE: Using TermDocs is way too slow.
推荐答案
使用TermDocs
获取给定文档的词频.与文档频率一样,您可以使用感兴趣的术语从 IndexReader
获取术语文档.
您不会找到比 TermDocs
更快的方法而不失一些通用性.TermDocs
直接从索引段中的.frq"文件中读取,其中每个术语频率按文档顺序列出.
You won't find a faster method than TermDocs
without losing some generality. TermDocs
reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.
如果这太慢",请确保您已优化索引以将多个段合并为一个段.按顺序遍历文档(跳过没问题,但不能高效地在文档列表中来回跳转).
If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).
您的下一步可能是进行额外处理,以创建一个更专业的文件结构,省略 SkipData
.就我个人而言,我会寻找更好的算法来实现我的目标,或者提供更好的硬件——大量内存,或者保存 RAMDirectory
,或者提供给操作系统以在其自己的文件缓存系统上使用.
Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData
. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory
, or to give to the OS for use on its own file-caching system.
相关文章