选择独立的全文搜索服务器:Sphinx 还是 SOLR?

2021-11-20 00:00:00 lucene full-text-search solr mysql sphinx

我正在寻找具有以下属性的独立全文搜索服务器:

I'm looking for a stand-alone full-text search server with the following properties:

  • 必须作为独立服务器运行,可以为来自多个客户端的搜索请求提供服务
  • 必须能够通过索引 SQL 查询的结果来进行批量索引":比如SELECT id, text_to_index FROM documents;"
  • 必须是免费软件,并且必须在使用 MySQL 作为数据库的 Linux 上运行
  • 必须快(排除 MySQL 的内部全文搜索)

我发现具有这些属性的替代方案是:

The alternatives I've found that have these properties are:

  • Solr(基于 Lucene)
  • ElasticSearch(同样基于 Lucene)
  • 狮身人面像

我的问题:

  • 他们如何比较?
  • 我是否错过了任何替代方案?
  • 我知道每个用例都不同,但在某些情况下我肯定不想想要使用某个包吗?
  • How do they compare?
  • Have I missed any alternatives?
  • I know that each use case is different, but are there certain cases where I would definitely not want to use a certain package?

推荐答案

我已经成功使用 Solr 快 2 年了,从来没有使用过 Sphinx,所以我显然有偏见.但是,我会尝试通过引用文档或其他人来保持其客观性.我也会给我的答案打补丁:-)

I've been using Solr successfully for almost 2 years now, and have never used Sphinx, so I'm obviously biased. However, I'll try to keep it objective by quoting the docs or other people. I'll also take patches to my answer :-)

相似之处:

  • Solr 和 Sphinx 都能满足您的所有要求.它们速度很快,旨在高效地索引和搜索大量数据.
  • 两者都有一长串使用它们的高流量站点(Solr、狮身人面像)
  • 两者都提供商业支持.(Solr、狮身人面像)
  • 两者都为多种平台/语言提供客户端 API 绑定(Sphinx、Solr)
  • 两者都可以分发以提高速度和容量(Sphinx、Solr)
  • Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently.
  • Both have a long list of high-traffic sites using them (Solr, Sphinx)
  • Both offer commercial support. (Solr, Sphinx)
  • Both offer client API bindings for several platforms/languages (Sphinx, Solr)
  • Both can be distributed to increase speed and capacity (Sphinx, Solr)

以下是一些差异:

  • Solr 是一个 Apache 项目,显然是 Apache2 许可的.Sphinx 是 GPLv2.这意味着,如果您需要在商业应用程序中嵌入或扩展(不仅仅是使用")Sphinx,则必须购买商业许可证(基本原理)
  • Solr 可轻松嵌入在 Java 应用程序中.
  • Solr 建立在 Lucene 之上,Lucene 是一项经过验证的技术,8岁,拥有庞大用户群(这只是一小部分).每当 Lucene 获得新功能或加速时,Solr 也会获得它.许多致力于 Solr 的开发人员也是 Lucene 的提交者.
  • Sphinx 与 RDBMS 的集成更紧密,尤其是 MySQL.
  • Solr 可以与Hadoop集成构建分布式应用程序
  • Solr 可以与 Nutch 集成,使用爬虫快速构建成熟的网络搜索引擎.
  • Solr 可以索引专有格式,如 Microsoft Word、PDF 等.Sphinx 不能.
  • Solr 带有一个开箱即用的拼写检查器.
  • Solr 带有开箱即用的方面支持.Sphinx 中的分面 需要更多工作.
  • Sphinx 不允许对字段数据进行部分索引更新.
  • 在 Sphinx 中,所有文档 ID 必须是唯一的无符号非零整数.Solr 许多操作甚至不需要唯一键,唯一键可以是整数或字符串.
  • Solr 支持字段折叠(目前仅作为附加补丁)以避免重复类似的结果.Sphinx 似乎没有提供任何这样的功能.
  • 虽然 Sphinx 旨在仅检索文档 ID,但在 Solr 中您可以直接获取包含几乎任何类型数据的整个文档,使其更加独立于任何外部数据存储,并节省了额外的往返行程.
  • Solr,除非嵌入使用,否则在 Java Web 容器(例如 Tomcat 或 Jetty)中运行,这需要 额外的特定配置和调整(或者您可以使用 包含 Jetty 并使用 java -jar start.jar 启动它).Sphinx 没有额外的配置.
  • Solr, being an Apache project, is obviously Apache2-licensed. Sphinx is GPLv2. This means that if you ever need to embed or extend (not just "use") Sphinx in a commercial application, you'll have to buy a commercial license (rationale)
  • Solr is easily embeddable in Java applications.
  • Solr is built on top of Lucene, which is a proven technology over 8 years old with a huge user base (this is only a small part). Whenever Lucene gets a new feature or speedup, Solr gets it too. Many of the devs committing to Solr are also Lucene committers.
  • Sphinx integrates more tightly with RDBMSs, especially MySQL.
  • Solr can be integrated with Hadoop to build distributed applications
  • Solr can be integrated with Nutch to quickly build a fully-fledged web search engine with crawler.
  • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can't.
  • Solr comes with a spell-checker out of the box.
  • Solr comes with facet support out of the box. Faceting in Sphinx takes more work.
  • Sphinx doesn't allow partial index updates for field data.
  • In Sphinx, all document ids must be unique unsigned non-zero integer numbers. Solr doesn't even require an unique key for many operations, and unique keys can be either integers or strings.
  • Solr supports field collapsing (currently as an additional patch only) to avoid duplicating similar results. Sphinx doesn't seem to provide any feature like this.
  • While Sphinx is designed to only retrieve document ids, in Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip.
  • Solr, except when used embedded, runs in a Java web container such as Tomcat or Jetty, which require additional specific configuration and tuning (or you can use the included Jetty and just launch it with java -jar start.jar). Sphinx has no additional configuration.

相关问题:

  • 使用 Rails 进行全文搜索
  • 完整的比较文本搜索引擎 - Lucene、Sphinx、Postgresql、MySQL?

相关文章