为什么在切换要比较的字符串顺序时,Difference 函数会给出不同的结果?

2021-09-10 00:00:00 tsql sql-server

在 SQL Server 中,如果我执行以下操作:

In SQL Server, if I do the following:

Difference ('Kennady', 'Kary') : I get 2

如果我这样做:

Difference ('Kary', 'Kennady') : I get 3.

我认为差异函数会查看底层的 Soundex 值,并给出一个 0-4 的数字,表示有多少字符是相同的.

I thought the Difference function looks at the Soundex values under the hood, and gives a 0-4 number of how many characters in place are the same.

SELECT SOUNDEX('Kennady') AS [SoundEx Kennady]
    , SOUNDEX('Kary') AS [SoundEx Kary]
    , DIFFERENCE ('Kennady', 'Kary') AS [Difference Kennady vs Kary]
    , DIFFERENCE ('Kary', 'Kennady') AS [Difference Kary vs Kennady];

推荐答案

这是严格的观察.文档 非常清楚:

This is strictly observational. The documentation is pretty clear:

返回的整数是 SOUNDEX 值中的字符数那是一样的.返回值范围从 0 到 4:0表示弱相似或无相似,4 表示强相似或相同的值.

The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.

根据本文档,返回值不应因参数的顺序而异.

According to this documentation, the return value should not differ based on the order of the arguments.

来自我的查询:Kennady"--> K530 和Kary"--> K600.它们有两个共同的字符,所以值应该是 2.

From my queries: "Kennady" --> K530 and "Kary" --> K600. These have two characters in common, so the value should be 2.

现在,我注意到Kenn"--> K500.将Kennady"截断为Kary"的长度会得到值3".嗯.

Now, I notice that "Kenn" --> K500. Truncating "Kennady" to the length of "Kary" results in the value "3". Hmmm.

因此,我认为 DIFFERENCE() 是使用第一个参数的长度来截断第二个参数.这使得参数的顺序很重要.先把较长的论点放在首位.

Hence, I think that DIFFERENCE() is using the length of the first argument to truncate the second argument. That makes the order of the arguments important. Put the longer argument first.

我在其他一些字符串上试过这个.相同的模式似乎有效.我还没有找到任何说明这种情况的文件.

I tried this out on some other strings. The same patterns seems to work. I have not found any documentation that specifies that this is the case.

我想微软会称其为功能"而不是错误";)

I suppose Microsoft would call this a "feature" and not a "bug" ;).

以上推测并不完全正确.考虑以下

The above speculation is not quite correct. Consider the following

  • leepaupauld --> L114
  • 利奥波德 --> L143
  • leepaup --> L110

然而,

  • difference(leepaupauld, leopold) = 4 (!)
  • 差异(利奥波德,利波保德)= 3
  • difference(leepaup, leopold) = 3 (!)
  • 差异(利奥波德,利帕普)= 2

考虑到字符串的 soundex 值,(!) 是我的判断,即结果根本没有意义.

The (!) is my judgement that the result makes no sense at all, given the soundex values for the strings.

所以,问题不在于长度.这是@jpw 在评论中指向的底层方法.问题似乎是一个字符串中的重复匹配值.但是,根据文档,这些不应该多次匹配同一个字符.

So, the issue isn't the length. It is the underlying method, which @jpw points to in the comment. The problem appears to be duplicate matching values in one string. However, according to the documentation, these should not match the same character multiple times.

我的建议:使用 Levenshtein 距离.这说得通.它在更长的字符串上效果更好.这是理智的.它不是内置的,但很容易在网络上找到任何数据库的实现.

My advice: Use Levenshtein distance. It makes sense. It works better on longer strings. It is sane. It is not built in, but it is easy enough to find an implementation on the web for any database.

相关文章