Lucene 和 SQL Server - 最佳实践
我是 Lucene 的新手,所以想从你们那里得到一些帮助:)
I am pretty new to Lucene, so would like to get some help from you guys :)
背景:目前我将文档存储在 SQL Server 中,并希望使用 Lucene 对 SQL Server 中的这些文档进行全文/标记搜索.
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) 在这种情况下,为了对文档进行关键字搜索,我应该将所有这些文档都插入到 Lucene 索引中吗?这是否意味着会有数据重复(一个在 SQL Server 中,另一个在 Lucene 索引中?)这可能是一个问题,因为我们有大量的文档(大约 100GB).这是不可避免的吗?
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) 此外,每个文档都有一组标签(最多 3 个).Lucene 也是标签搜索的好选择吗?如果有,该怎么做?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
谢谢,
推荐答案
是的,通过 Lucene 提供全文搜索并通过传统数据库提供数据存储是一个得到良好支持的架构.看这里,简单介绍一下.一个典型的实现是对您希望能够支持搜索的任何内容进行索引,并在 Lucene 索引中仅存储一个唯一标识符,并根据 ID 从数据库中提取通过搜索找到的任何记录.如果你想减少DB负载,你可以在Lucene中存储一些信息来显示搜索结果列表,并且只查询数据库以获取完整的文档.
Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
至于节省空间,会有一些重复.不过,即使您只使用 Lucene,也是如此.Lucene 存储用于搜索的倒排索引与存储的数据完全分开.为了节省空间,我建议您仔细考虑选择要索引的数据,以及您需要存储和以后能够检索的数据.您存储的内容对于在 Lucene 中节省空间尤为重要,因为在大多数情况下,仅索引值往往非常节省空间.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene 当然可以实现标签搜索.实现它的简单方法是将每个标签添加到您选择的字段中(我称之为标签",这似乎很有意义),同时构建文档,例如:
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
我可以简单地在任何查询中添加一个必填项,以便仅在特定标签内进行搜索.例如,如果我要搜索一些东西",但只使用标签forkids",我可以编写如下查询:
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids
相关文章