SQL 模糊匹配

2021-12-13 00:00:00 full-text-search fuzzy-search sql-server-2008 sql-server

希望我不要重复这个问题.在这里发帖之前，我在这里和谷歌做了一些搜索.

Hope i am not repeating this question. I did some search here and google before posting here.

我正在使用启用全文的 SQL Server 2008R2 运行电子商店.

I am running a eStore with SQL Server 2008R2 with Full Text enabled.

有一个产品表，其中包含产品名称、OEM 代码、该产品适合的型号.都是文字.
我创建了一个名为 TextSearch 的新列.这已将产品名称、OEM 代码和该产品适用的型号的值串联在一起.这些值以逗号分隔.
当客户输入关键字时，我们会在 TextSearch 列上运行搜索以匹配产品.请参阅下面的匹配逻辑.

我正在使用混合全文和正常的搜索.这给出了更相关的结果.返回在临时表中执行的所有查询和不同的查询.

I am using a Hybrid Fulltext and normal like to do search. This gives more relevant results. All the queries executed in to a temp table and distincts were returned.

匹配逻辑，

运行以下 SQL 以使用全文获取相关产品.但是@Keywords 将被预处理.假设CLC 2200"将更改为CLC* AND 2200*"

Run following SQL to get relevant product using full text. But @Keywords will be pre-processed. Say 'CLC 2200' will be changed to 'CLC* AND 2200*'

SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,@Keywords)

SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,@Keywords)

另一个查询将使用正常方式运行.因此，CLC 2200"将被预处理为TextSearch like %clc% AND TextSearch like %2200%".这仅仅是因为全文搜索不会在关键字之前搜索模式.例如，它不会返回pclc 2200".

Another query will be running using normal like. So 'CLC 2200' will be pre-processed to 'TextSearch like %clc% AND TextSearch like %2200%'. This is simply because full text search wont search patterns before the keywords. example, it wont return 'pclc 2200'.

SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'

SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'

如果第 1 步和第 2 步没有返回任何记录，则将执行以下搜索.我对值 135 进行了微调以返回更多相关记录.

If step 1 and 2 didn't return any records, following search will be executed. Value 135 was fine tuned by me to return more relevant records.

SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,@Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135

SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,@Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135

以上所有组合都以合理的速度运行良好，并返回关键字的相关产品.

All of above combined works fine in a reasonable speed and returns relevant products for keywords.

但我希望在根本找不到任何产品的情况下进一步改进.

But i am looking for to further improve when there is no product found at all.

假设客户寻找CLC 2200npk"而该产品不存在，我需要在CLC 2200"附近展示下一个.

Say if customer looks for 'CLC 2200npk' and this product wasn't there, i needed to show next very close by 'CLC 2200'.

到目前为止，我尝试使用 Soundex() 函数.购买计算TextSearch 列中每个单词的soundex 值并与关键字的soudex 值进行比较.但这会返回太多记录，而且速度也很慢.

So far i tried using Soundex() function. Buy computing soundex value for each word in TextSearch column and comparing with soudex value of keyword. But this returns way too many records and slow too.

例如，CLC 2200npk"将返回CLC 1100"等产品.但这不会是一个好的结果.由于它不接近 CLC 2200npk

example, 'CLC 2200npk' will return products such as 'CLC 1100' etc. But this wouldn't be a good result. As it is not close to CLC 2200npk

这里还有一个不错的选择.但这使用了 CLR 函数.但是我无法在服务器上安装 CLR 功能.

There is another good one here. but this uses CLR Functions. But i can not install CLR functions on the server.

所以我的逻辑应该是，

如果未找到CLC 2200npk"，则在CLC 2200"附近显示如果未找到CLC 2200"，则显示CLC 1100"附近的下一个

if 'CLC 2200npk' not found, show close by 'CLC 2200' if 'CLC 2200' not found, show next close by 'CLC 1100'

是否可以按照建议进行匹配?
如果我需要进行拼写更正和搜索，有什么好方法?我们所有的产品列表都是英文的.
是否有任何 UDF 或 SP 来匹配我的建议之类的文本?

谢谢.

推荐答案

一个相当快速的领域特定解决方案可能是使用 SOUNDEX 和 2 个字符串之间的数字距离来计算字符串相似度.只有当您拥有大量产品代码时，这才会真正有用.

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.

使用像下面这样的简单 UDF，您可以从字符串中提取数字字符，这样您就可以从CLC 2200npk"中获取 2200 个字符，从CLC 1100"中获取 1100 个字符，因此您现在可以根据 SOUNDEX 输出确定接近度每个输入以及每个输入的数字分量的接近度.

Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.

CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000)) RETURNS INT AS BEGIN WHILE PATINDEX('%[^0-9]%', @input) > 0 BEGIN SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '') END IF @input = '' OR @input IS NULL SET @input = '0' RETURN CAST(@input AS INT) END GO

就通用算法而言，有几种可能会根据数据集大小和性能要求帮助您取得不同程度的成功.(两个链接都有可用的 TSQL 实现)

As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)

Double Metaphone - 这种算法会给你一个比 soundex 更好的匹配，代价是虽然速度很快，但对于拼写纠正来说确实很有用.
Levenshtein 距离 - 这将计算如何将一个字符串转换为另一个字符串需要多次按键，例如从CLC 2200npk"到CLC 2200"是 3，而从CLC 2200npk"到CLC 1100"是 5.

Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.

Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.

这里是一篇有趣的文章，它同时应用了这两种算法可能会给你一些想法.

Here is an interesting article which applies both algos together which may give you a few ideas.

希望其中的一些帮助有点小.

Well hopefully some of that helps a little.

这里是一个更快的部分 Levenshtein 距离实现(阅读帖子它不会返回与正常结果完全相同的结果).在我的 125000 行测试表上，它在 6 秒内运行，而我链接到的第一个表则为 60 秒.

Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.

相关文章