使用 Solr 索引多种语言

2022-01-15 00:00:00 lucene solr java

我们正在设置一个 Solr 来索引文档,其中标题字段可以是各种语言.谷歌搜索后,我发现了两个选项:

We're setting up a Solr to index documents where title field can be in various languages. After googling I found two options:

  1. 定义不同的架构字段每种语言,即title_en,title_fr,... 应用不同的过滤到每种语言然后查询标题字段之一对应的语言.
  2. 创作不同的 Solr 内核来处理每个语言并进行我们的应用查询正确的 Solr 核心.

哪个更好?有什么大起大落?

Which one is better? What are the ups and downs?

谢谢

推荐答案

还有第三种选择,您可以为所有语言使用一组通用字段,但对字段 language 应用过滤器.例如,如果您有字段 text, language 您可以将所有语言的文本内容放入 text 字段并使用例如 fq=language:english 仅检索英文文档.

There's also a third alternative where you use a common set of fields for all languages but apply a filter to a field language. For instance if you have the fields text, language you can put text contents for all languages in to the text field and use e.g., fq=language:english to only retrieve english documents.

这种方法的缺点是您不能使用特定于语言的功能,例如 lemmatisationstemming 等.

The downside of this approach is that you cannot use language specific features such as lemmatisation, stemming, etc.

为每种语言定义不同的架构字段,即 title_en、title_fr、...对每种语言应用不同的过滤器,然后使用相应语言查询其中一个标题字段.

Define different schema fields for every language i.e. title_en, title_fr,... applying different filters to each language then query one of title fields with a corresponding language.

这种方法提供了很好的灵活性,但当存在多种语言时,请注意高内存消耗和复杂性.这可以使用多个 solr 服务器来缓解.

This approach gives good flexibility, but beware of high memory consumption and complexity when many languages are present. This can be mitigated using multiple solr servers.

创建不同的 Solr 核心来处理每种语言并使我们的应用查询正确的 Solr 核心.

Creating different Solr cores to handle each language and make our app query correct Solr core.

绝对是一个不错的解决方案.但是,单独的管理和轻微的开销是否适合您可能与您希望使用的语言数量有关.

Definately a nice solution. But whether the separate administration and slight overhead will work for you is probably in relation to the number of languages you wish to use.

除非第一种方法适用,否则我可能会倾向于第二种方法,除非不需要内核的可扩展性.不过,这两种方法都很好,我认为这基本上归结为偏好.

Unless the first approach is applicable, I would probably lean towards the second one unless the scalability of cores isn't desired. Either approach is fine though and I think it basicaly comes down to preference.

相关文章