使用 Soundex、Jaro Winkler 和 Edit Distance (UTL_MATCH) 匹配 Oracle 重复列值
我正在尝试找到一种可靠的方法来匹配数据库中的重复人员记录.数据存在一些严重的数据质量问题,我也在努力解决这些问题,但在获得批准之前,我一直坚持使用现有数据.
I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.
我可用的表格列是:
SURNAME VARCHAR2(43)
FORENAME VARCHAR2(38)
BIRTH_DATE DATE
ADDRESS_LINE1 VARCHAR2(60)
ADDRESS_LINE2 VARCHAR2(60)
ADDRESS_LINE3 VARCHAR2(60)
ADDRESS_LINE4 VARCHAR2(60)
ADDRESS_LINE5 VARCHAR2(60)
POSTCODE VARCHAR2(15)
SOUNDEX
函数在此用途上相对有限,但 UTL_MATCH
包似乎使用 Jaro Winker 算法提供了更好的匹配级别.
The SOUNDEX
function is relatively limited for this use but the UTL_MATCH
package seems to offer a better level of matching using the Jaro Winker algorithm.
没有人重新发明轮子,而是实现了一种可靠的方法来匹配此类数据?
Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?
需要解决的数据质量问题:
Data Quality issues to contend with:
- 邮政编码虽然是强制性的,但并不总是完全输入.
- 地址数据质量相对较差,地址输入格式不固定(即,有些可能将 line1 设为Flat 1",而有些可能将 line1 设为Flat1, 22 Acacia Ave").
- 名字列可以包含首字母、完整名字或有时不止一个名字.
例如我正在考虑:
连接所有地址字段并将 Jaro Winkler 算法应用于完整地址,并结合对连接在一起的全名进行类似测试.
Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.
可以直接比较出生日期以进行匹配,但由于数据量很大,仅进行匹配是不够的.
The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.
Oracle 10g R2 企业版.
Oracle 10g R2 Enterprise Edition.
欢迎任何有用的建议.
推荐答案
我正在尝试寻找一种可靠的方法来匹配重复的人数据库中的记录."
"I am trying to find a reliable method for matching duplicate person records within the database."
唉,没有这样的事情.您最希望的是一个带有合理怀疑元素的系统.
Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.
SQL> select n1
, n2
, soundex(n1) as sdx_n1
, soundex(n2) as sdx_n2
, utl_match.edit_distance_similarity(n1, n2) as ed
, utl_match.jaro_winkler_similarity(n1, n2) as jw
from t94
order by n1, n2
/
2 3 4 5 6 7 8 9
N1 N2 SDX_ SDX_ ED JW
-------------------- -------------------- ---- ---- ---------- ----------
MARK MARKIE M620 M620 67 93
MARK MARKS M620 M620 80 96
MARK MARKUS M620 M622 67 93
MARKY MARKIE M620 M620 67 89
MARSK MARKS M620 M620 60 95
MARX AMRX M620 A562 50 91
MARX M4RX M620 M620 75 85
MARX MARKS M620 M620 60 84
MARX MARSK M620 M620 60 84
MARX MAX M620 M200 75 93
MARX MRX M620 M620 75 92
11 rows selected.
SQL> SQL> SQL>
SOUNDEX 的一大优势是它可以对字符串进行标记.这意味着它为您提供了可以编入索引的内容:当涉及到大量数据时,这非常有价值.另一方面,它又旧又粗糙.周围有更新的算法,例如 Metaphone 和 Double Metaphone.您应该可以通过 Google 找到它们的 PL/SQL 实现.
The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.
评分的优势在于它们允许一定程度的模糊;所以你可以找到所有 name_score >= 90%
的行.压倒性的缺点是分数是相对的,所以你不能索引它们.这种比较会让你大吃一惊.
The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%
. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.
这意味着:
- 您需要多种策略.没有任何一种算法可以解决您的问题.
- 数据清理很有用.比较 MARX 与 MRX 和 M4RX 的分数:从名称中去除数字可提高命中率.
- 您无法即时获得大量姓名.如果可以,请使用标记化和预评分.如果您没有大量流失,请使用缓存.如果您负担得起,请使用分区.
- 使用 Oracle Text(或类似的)构建昵称和变体的同义词库.
- Oracle 11g 向 Oracle Text 引入了特定的名称搜索功能.了解更多信息.
- 为评分建立一个规范名称表,并将实际数据记录链接到该表.
- 使用其他数据值(尤其是出生日期等可索引的值)来预先过滤大量姓名或增加对建议匹配项的置信度.
- 请注意,其他数据值也有其自身的问题:出生于 11 年 1 月 31 日的人是 11 个月大还是 80 岁?
- 请记住,名称很棘手,尤其是当您必须考虑已罗马化的名称时:Moammar Khadaffi 有四百多种不同的拼写方式(以罗马字母表) - 甚至 Google 也无法就哪个变体是最规范的.
根据我的经验,连接令牌(名字、姓氏)是喜忧参半.它解决了某些问题(例如道路名称是否出现在地址行 1 或地址行 2)但会导致其他问题:考虑评分 GRAHAM OLIVER vs OLIVER GRAHAM 反对评分 OLIVER vs OLIVER、GRAHAM vs GRAHAM、OLIVER vs GRAHAM 和 GRAHAM vs OLIVER.
In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.
无论您做什么,最终都会出现误报和未命中.没有算法可以防止打字错误(尽管 Jaro Winkler 在 MARX 与 AMRX 上的表现相当不错).
Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).
相关文章