在包含 1 亿个字符串的大型文本文件中进行高效的子字符串搜索(无重复字符串)

2022-01-15 00:00:00 file search lucene mysql java

我有一个包含 1 亿个字符串(没有重复字符串)的大型文本文件(1.5 Gb)，并且所有字符串在文件中逐行排列.我想在java中制作一个wepapplication，以便当用户给出关键字(子字符串)时，他可以获得包含该关键字的文件中存在的所有字符串的计数.我已经知道一种技术 LUCENE..还有其他方法可以做到这一点吗??我希望在 3-4 秒内得到结果.我的系统有 4GB 内存和双核配置....需要在仅限 JAVA"中执行此操作

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"

推荐答案

由于您的 RAM 大于文件的大小，您也许可以将整个数据作为结构存储在 RAM 中并快速搜索.A trie 可能是一个很好的数据结构；它确实有快速的前缀查找，但不确定它对子字符串的执行情况.

Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.

相关文章