何时考虑 Solr

2021-12-30 00:00:00 performance solr mysql

我正在开发一个应用程序,需要使用搜索做一些有趣的事情,包括全文搜索、点击突出显示、分面搜索等...

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...

数据集很可能在 3000-10000 条记录之间,每个记录有 20-30 个字段,并且全部存储在 MySQL 中.该网站的流量概况很可能处于中小规模.

The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.

所有这些要求都可以在 MySQL 中(笨拙地)实现,但在什么时候(就数据大小和流量级别而言)值得关注 Solr 或 Sphinx 等更专注的技术?

All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?

推荐答案

这个问题需要一个非常广泛的答案,需要从各个方面来回答.对于特殊用例,有一些特定的规范可能会使一个系统优于另一个系统,但我想在这里介绍基础知识.

This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.

我将完全以 Solr 为例,介绍几个功能大致相同的搜索引擎.

I will deal entirely with Solr as an example for several search engines that function roughly the same way.

我想从一些确凿的事实开始:

I want to start with some hard facts:

  • 您不能依赖 Solr/Lucene 作为安全数据库.有一个事实列表,但它们主要包括缺少恢复选项、缺少酸性事务、可能的并发症等.如果您决定使用 solr,则需要从另一个来源(如 SQL 表)填充索引.事实上,solr 非常适合存储包含来自多个表和关系的数据的文档,否则将需要构建复杂的连接.

  • You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.

Solr/Lucene 提供了令人兴奋的文本分析/词干提取/全文搜索评分/模糊性功能.使用 MySQL 无法完成的事情.事实上,MySql 中的全文搜索仅限于 MyIsam 并且评分非常琐碎和有限.加权字段、根据特定指标提升文档、基于短语接近度的评分结果、匹配准确度等都是非常困难的工作,几乎是不可能的.

Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.

在 Solr/Lucene 中,您有文档.您无法真正存储关系和流程.好吧,您当然可以在某个文档的多值字段中索引其他文档的键,这样您就可以实际存储 1:n 关系并以两种方式获取 n:n,但它的数据开销.不要误会我的意思,它在很多用途上都非常好和高效(例如,对于某些产品目录,您想在其中存储产品的分销商,并且只想搜索某些分销商提供的零件或其他东西).但是您已经达到了 HAS/HAS NOT 的可能性.您几乎不能做诸如获得至少在 3 个分销商处可用的所有产品"之类的事情.

In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".

Solr/Lucene 具有非常好的分面特征和搜索后分析.例如:在具有 40000 次点击的非常广泛的搜索之后,您可以显示如果您将搜索细化为将此字段设为该值和该字段设为该值的组合,您将只会获得 3 次点击.需要在 MySQL 中进行额外查询的内容可以高效便捷地完成.

Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.

总结一下

  • Lucene 的强大之处在于文本搜索/分析.由于反向索引结构,它也非常快.你真的可以做很多后期处理,满足其他需求.尽管它是面向文档的,并且没有像使用 SPARQL 的三元组存储那样的图形查询",但可以存储和查询基本的 N:M 关系.如果您的应用程序专注于文本搜索,那么在没有充分理由(例如非常复杂的多维度范围过滤器查询)的情况下,您绝对应该选择 Solr/Lucene.

  • The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.

如果您没有文本搜索,而是可以指向并单击某些内容但不能输入文本的内容,那么好的旧关系数据库可能是更好的选择.

If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.

相关文章