高效过滤/搜索

2022-01-15 00:00:00 filtering search lucene mysql saas

我们有一个管理内容页面的托管应用程序.每个页面可以有许多自定义字段，以及一些标准字段(时间戳、用户名、用户电子邮件等).

We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).

可能有数百个不同的网站在使用该系统——处理过滤/搜索的有效方法是什么?想象一个您想要缩小范围的网格视图.您可以过滤特定字段(用户 ID、日期)，也可以输入全文搜索.

With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.

例如，所有由 userid 10 开始的页面"将是一个针对 MySQL 数据库的非常快速的查询.但是像由用户 ID 为 10 且匹配 [某些搜索查询] 的用户启动的所有页面"这样的内容会影响数据库，因此它适合 Lucene 等搜索引擎.

For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.

基本上我想知道其他大型网站是如何做这种事情的.他们是否 100% 使用搜索引擎进行所有类型的过滤?他们是否将数据库查询与搜索引擎混合在一起?

Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?

如果我们仅使用搜索引擎，则会出现新/更新对象出现在搜索索引中的延迟时间问题.也就是说，我读到立即更新索引并不明智，而是分批进行.即使这意味着每 5 分钟一次，当用户查看一个简单的页面列表(例如category:5"的搜索查询)时，当他们最近添加的页面没有立即列出时，用户也会感到困惑.

If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").

我们正在使用 MySQL，并且一直在密切关注 Lucene 进行搜索.还有其他一些我不知道的技术吗?

We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?

我的想法是提供一个简单的过滤页面，它使用 MySQL 过滤基本字段.然后提供一个单独的全文搜索页面，该页面将显示类似于 Google 的结果.这是唯一的方法吗?

My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?

推荐答案

Solr 或 Grassyknoll 都为 Lucene 提供了稍微抽象的接口.

Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.

也就是说:是的.如果您是一个主要内容驱动的网站，提供对您的数据的全文搜索，那么除了 LIKE 之外，还有一些东西在起作用.虽然 MySql 的 FULLTEXT 索引并不完美，但在此期间它可能是一个可接受的占位符.

That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.

假设您确实创建了一个 Lucene 索引，将 Lucene 文档链接到您的关系对象非常简单，只需在索引时向文档添加一个存储属性(此属性可以是 url、ID、GUID 等)然后，搜索变成一个两相系统:1)向Lucene索引发出查询(显示简单的结果，如标题)2)通过它的键从关系存储中获取有关对象的更多详细信息

Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system: 1) Issue query to Lucene indexies (Display simple results like title) 2) Get more detailed information about the object from your relational stores by its key

由于文档的实例化在 Lucene 中相对昂贵，因此您只想存储在 Lucene 索引中搜索的字段，而不是关系对象的完整克隆.

Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.

相关文章