在包含 1 亿个字符串的大型文本文件中进行高效的子字符串搜索(无重复字符串)

2022-01-15 00:00:00 file search lucene mysql java

我有一个包含 1 亿个字符串(没有重复字符串)的大型文本文件(1.5 Gb),并且所有字符串在文件中逐行排列.我想在java中制作一个wepapplication,以便当用户给出关键字(子字符串)时,他可以获得包含该关键字的文件中存在的所有字符串的计数.我已经知道一种技术 LUCENE..还有其他方法可以做到这一点吗??我希望在 3-4 秒内得到结果.我的系统有 4GB 内存和双核配置....需要在仅限 JAVA"中执行此操作

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"


由于您的 RAM 大于文件的大小,您也许可以将整个数据作为结构存储在 RAM 中并快速搜索.A trie 可能是一个很好的数据结构;它确实有快速的前缀查找,但不确定它对子字符串的执行情况.

Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.
