什么是匹配两个包含少于 10 个拉丁文单词的字符串的最佳算法

2022-01-15 00:00:00 similarity string-matching lucene java

我正在比较歌曲标题,使用拉丁脚本(尽管并非总是如此),我的目标是一种算法,如果两首歌曲标题似乎是相同的标题,则给出高分,如果它们什么都没有,则给出非常低的分数共同点.

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.

现在我已经不得不使用 Lucene 和 RAMDirectory 编写代码 (Java) 来编写这个 - 但是仅使用 Lucene 来比较两个字符串太重,因此太慢了.我现在开始使用 https://github.com/nickmancol/simmetrics ,它有很多不错的算法比较两个字符串:

Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:

https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics

BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex

但我对这些算法并不精通,什么是好的选择?

but I'm not well versed in these algorithms and what would be a good choice ?

我认为 Lucene 以某种形式使用 CosineSimilarity,所以这是我的出发点,但我认为可能会有更好的东西.

I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.

具体来说,该算法应该适用于短字符串,并且应该理解单词的概念,即应该对空格进行特殊处理.拉丁文字的良好匹配是最重要的,但其他文字(如韩文和中文)的良好匹配也是相关的,但我希望由于它们处理空格的方式而需要不同的算法.

Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

推荐答案

他们都很好.它们作用于字符串的不同属性并具有不同的匹配属性.最适合您的取决于您的需求.

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.

我正在使用 JaccardSimilarity 来匹配名称.我选择 JaccardSimilarity 是因为它相当快,而且对于短字符串而言,它擅长将名称与常见的错字匹配,同时迅速降低其他任何东西的分数.给空间额外的重量.它也对词序不敏感.我需要这种行为,因为误报的影响比误报的影响要大得多,空格可能是拼写错误,但不常见,而且词序并不那么重要.

I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.

请注意,这是结合删除非变音符号的简化器和将剩余字符映射到 a-z 范围的映射器完成的.这是通过将所有单词分隔符标准化为单个空格的规范化传递的.最后,解析名称以挑选出首字母、前缀和后缀.这是因为名称具有一种结构和格式,它对字符串比较具有相当的抵抗力.

Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.

要做出选择,您需要列出您想要的标准,然后寻找满足这些标准的算法.您还可以创建一个相当大的测试集并在该测试集上运行所有算法,看看在时间、阳性数、假阳性、假阴性和阴性数方面的权衡是什么,您的系统应该处理的错误类别,等等,等等.

To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.

如果您仍然不确定自己的选择,您还可以设置系统以在运行时切换精确的比较算法.这使您可以进行 A-B 测试并查看哪种算法在实践中效果最佳.

If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.

TLDR;您想要哪种算法取决于您需要什么,如果您不知道自己需要什么,请确保您可以稍后更改它并即时运行测试.

TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

相关文章