即时搜索 PB 级数据

2022-01-15 00:00:00 hadoop lucene solr java

我需要在 CSV 格式文件中搜索超过 PB 的数据.使用 LUCENE 建立索引后,索引文件的大​​小是原始文件的两倍.是否可以减少索引文件的大​​小???HADOOP中如何分发LUCENE索引文件以及如何在搜索环境中使用?还是有必要,我应该使用 solr 来分发 LUCENE 索引吗???我的要求是对 PB 的文件进行即时搜索....

I need to search over petabyte of data in CSV formate files. After indexing using LUCENE, the size of the indexing file is doubler than the original file. Is it possible to reduce the indexed file size??? How to distribute LUCENE index files in HADOOP and how to use in searching environment? or is it necessary, should i use solr to distribute the LUCENE index??? My requirement is doing instant search over petabyte of files....

推荐答案

任何体面的现成搜索引擎(如 Lucene)都应该能够提供超过您拥有的数据大小的搜索功能.您可能需要预先做一些工作来设计索引并配置搜索的工作方式,但这只是配置.

Any decent off the shelf search engine (like Lucene) should be able to provide search functionality over the size of data you have. You may have to do a bit of work up front to design the indexes and configure how the search works, but this is just config.

您不会立即获得结果,但您也许能够很快获得结果.速度可能取决于您的设置方式以及运行的硬件类型.

You won't get instant results but you might be able to get very quick results. The speed will probably depend on how you set it up and what kind of hardware you run on.

您提到索引大于原始数据.这是可以预料的.索引通常包括某种形式的非规范化.索引的大小通常是与速度的权衡;提前对数据进行切片和切块的方式越多,找到参考的速度就越快.

You mention that the indexes are larger than the original data. This is to be expected. Indexing usually includes some form of denormalisation. The size of the indexes is often a trade off with speed; the more ways you slice and dice the data in advance, the quicker it is to find references.

最后你提到了分发索引,这几乎肯定不是你想做的事情.分发许多 PB 数据的实用性非常令人生畏.您可能想要的是将索引放在某处的大型计算机上并提供数据搜索服务(将查询带到数据中,不要将数据带到查询中).

Lastly you mention distributing the indexes, this is almost certainly not something you want to do. The practicalities of distributing many petabytes of data are pretty daunting. What you probably want is to have the indexes sat on a big fat computer somewhere and provide search services on the data (bring the query to the data, don't take the data to the query).

相关文章