SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值
问题:需要 SQL 函数使用 Levenshtein 算法返回最低"匹配值.
代码:
<预><代码>创建函数 ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))返回整数作为开始声明 @s1_len int, @s2_len int声明 @i int、@j int、@s1_char nchar、@c int、@c_temp int声明@cv0 varbinary(8000), @cv1 varbinary(8000)选择@s1_len = LEN(@s1),@s2_len = LEN(@s2),@cv1 = 0x0000,@j = 1,@i = 1,@c = 0而@j <= @s2_lenSELECT @cv1 = @cv1 + CAST(@j AS binary(2)),@j = @j + 1当@i <= @s1_len开始选择@s1_char = SUBSTRING(@s1, @i, 1),@c = @i,@cv0 = CAST(@i AS binary(2)),@j = 1当@j <= @s2_len开始设置@c = @c + 1SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +当@s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END如果@c >@c_temp SET @c = @c_tempSET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1如果@c >@c_temp SET @c = @c_tempSELECT @cv0 = @cv0 + CAST(@c AS binary(2)),@j = @j + 1结尾选择@cv1 = @cv0,@i = @i + 1结尾返回@c结尾如果 OBJECT_ID('tempdb..#ExistingCustomers') 不是 NULL删除表#ExistingCustomers;创建表#ExistingCustomers(客户 VARCHAR(255),身份证号码)INSERT #ExistingCustomers SELECT 'Ed''s Barbershop', 1002INSERT #ExistingCustomers SELECT 'GroceryTown', 1003INSERT #ExistingCustomers SELECT 'Candy Place', 1004INSERT #ExistingCustomers SELECT 'Handy Man', 1005如果 OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') 不是 NULL下降表#潜在客户;创建表#POTENTIALCUSTOMERS(客户VARCHAR(255));插入 #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'INSERT #POTENTIALCUSTOMERS SELECT '杂货城'插入 #POTENTIALCUSTOMERS 选择糖果店"INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'插入 #POTENTIALCUSTOMERS 选择苹果农场"插入 #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'选择 A. 客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15;返回:
我想退货:
说明:结果是 Levenshtein 算法的最低"值.有两行 Levenshtein 分数相同 The Apple Farm
和 Ride-a-Long Bikes
,在这种情况下,任何值都可以,只要它是一种价值.
参考资料:
SQL 模糊连接 - MSSQL
http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
解决方案如果您按潜在客户进行分区并使用 ValueLev 对结果进行排序,则可以使用 CTE 来获得您想要的结果:
;with CTE AS(SELECT RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,一个客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15)选择客户,MIN(ID) 作为 ID,MIN(cust) AS cust,价值等级从 CTE哪里 CTE.RowNbr = 1按客户分组,ValueLev
由于您不介意在重复 ValueLev
的情况下返回哪个结果,请使用 GROUP BY
和 MIN
来缩放结果每个潜在客户最多一个.
输出:
Customer ID cust ValueLev糖果店 1004 糖果店 0杂货镇 1003 杂货镇 0Eds 理发店 1002 Ed 的理发店 1勤杂工 1005 勤杂工 2苹果农场 1004 Candy Place 9Ride-a-Long Bikes 1003 Candy Place 14
Problem: Need SQL function to return the 'lowest' matching value using the Levenshtein algorithm.
Code:
CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
@j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i,
@cv0 = CAST(@i AS binary(2)),
@j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
DROP TABLE #ExistingCustomers;
CREATE TABLE #ExistingCustomers
(
Customer VARCHAR(255),
ID INT
)
INSERT #ExistingCustomers SELECT 'Ed''s Barbershop', 1002
INSERT #ExistingCustomers SELECT 'GroceryTown', 1003
INSERT #ExistingCustomers SELECT 'Candy Place', 1004
INSERT #ExistingCustomers SELECT 'Handy Man', 1005
IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
DROP TABLE #POTENTIALCUSTOMERS;
CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));
INSERT #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'
INSERT #POTENTIALCUSTOMERS SELECT 'Grocery Town'
INSERT #POTENTIALCUSTOMERS SELECT 'Candy Place'
INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'
INSERT #POTENTIALCUSTOMERS SELECT 'The Apple Farm'
INSERT #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'
SELECT A.Customer,
b.ID,
b.Customer as cust,
dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15;
This returns:
What I would like to return:
Explanation: The results are the 'lowest' values from the Levenshtein algorithm. There are two rows where the Levenshtein scores are the same The Apple Farm
and Ride-a-Long Bikes
, in which case any of the values is fine, just as long as it is one value.
References:
SQL Fuzzy Join - MSSQL
http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
解决方案You can use CTE to get the result you want if you partition by the potential customer and use the ValueLev to order the results:
;WITH CTE AS
(
SELECT RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,
A.Customer,
b.ID,
b.Customer as cust,
dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15
)
SELECT Customer,
MIN(ID) AS ID,
MIN(cust) AS cust,
ValueLev
FROM CTE
WHERE CTE.RowNbr = 1
GROUP BY Customer, ValueLev
As you don't mind which result is returned in the case of duplicate ValueLev
, use GROUP BY
and MIN
to scale the results down to one per potential customer.
Output:
Customer ID cust ValueLev
Candy Place 1004 Candy Place 0
Grocery Town 1003 GroceryTown 0
Eds Barbershop 1002 Ed's Barbershop 1
Handee Man 1005 Handy Man 2
The Apple Farm 1004 Candy Place 9
Ride-a-Long Bikes 1003 Candy Place 14
相关文章