使用 T-SQL 查找近似重复的数据库记录?

2021-09-10 00:00:00 tsql sql-server

大家好.我有一个包含相当多行的 MSSQL 2008 数据库.到目前为止,在将新行插入表中之前,存储过程会检查该记录是否已存在于数据库中(通过检查标记为 Title 的列).这个检查是精确的,如果要插入的记录略有不同,它会插入它而不是更新现有行(这是一个近似匹配).我想做的是在插入之前以某种方式检测表中的近似重复.所以要插入的新记录:

Hey all. I have a MSSQL 2008 database with a fair number of rows. As of now, before new rows are inserted into the table, the stored procedure checks to see if that record already exists in the database (by checking a column labeled Title). This check is exact, and if the to-be-inserted record is slightly different, it will insert it instead of updating the existing row (which is an approximate match). What I would like to do is somehow detect approximate duplications in the table before inserting. So a new record that is to be inserted:

The quick brown fox jumps over the lazy dog

大致匹配:

Quick brown fox jumps over the lazy dog

如果该记录已经存在于表中.我已经看到(并用于其他情况)在 T-SQL 中实现的 Levenshtein Distance 算法,但我不确定这是否适用于我的情况,因为执行算法需要一对输入字符串.社区成员如何处理此类事情?谢谢.

if this record exists in the table already. I've seen (and used for other situations) the Levenshtein Distance algorithm implemented in T-SQL, but I'm not sure if this could be applied in my case because a pair of input strings are required to execute the algorithm. How are members of the community handing things of this sort? Thanks.

推荐答案

全文搜索是您最好的选择.由于需要大量的计算,在任何非平凡大小的文本语料库上使用 Levenshtein 很快就会出现问题.对于基于字符的差异而不是基于单词的差异,更常见的是使用 LD/SOUNDEX 等.假设单词至少拼写正确,FTS 会更合适.我还可以想象一种使用 FTS 来识别可能的匹配候选者的两层方法,并在过滤后的集合上执行更细粒度的匹配.如果你真的想去城里,那么搜索文本的最佳结构之一是 Trie,但这在表中实现起来很棘手,并且作为内存中的数据结构效果更好.基于单词的 n-gram 解决方案也可能值得研究.

Full-Text Search is your best bet here. Using Levenshtein on any non-trivial sized corpus of text soon becomes problematic due to the computational grunt required. It's more common to use LD/SOUNDEX etc for character based discrepancies rather than word based discrepancies. Assuming words are at minimum correctly spelled, FTS would be a better fit. I can also imagine a two-tiered approach using FTS to identify likely match candidates, with finer grained matching performed over the filtered set. If you really want to go to town, then one of the best performing structures for searching text is the Trie, but this is tricky to implement in tables, and works better as an in-memory data-structure. A word based n-gram solution might also be worth investigating.

相关文章