如何使用 PHP 对 MYSQL 中的公司名称进行模糊匹配以进行自动完成?

2021-11-20 00:00:00 matching string fuzzy-search mysql

我的用户将通过剪切和粘贴导入包含公司名称的大字符串.

我有一个现有且不断增长的公司名称 MYSQL 数据库,每个数据库都有一个唯一的 company_id.

我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配.

现在,只是进行直接的字符串匹配也很慢.** Soundex 索引会更快吗?我怎样才能在用户打字时给他们一些选择?**

例如,有人写道:

<前>微软 -> 微软Bare Essentials -> Bare EscentualsPolycom, Inc. -> Polycom

我发现以下线程似乎与此问题类似,但发布者尚未批准,我不确定它们的用例是否适用:

如何在大型字符串数据库中找到字符串的最佳模糊匹配

在 Java 中匹配不准确的公司名称

解决方案

您可以从使用 开始SOUNDEX(),这可能会满足您的需要(我想象了一个自动建议框,其中包含用户正在键入的内容的现有替代方案).

SOUNDEX() 的缺点是:

  • 无法区分较长的字符串.只考虑前几个字符,末尾发散的较长字符串生成相同的 SOUNDEX 值
  • 第一个字母必须相同,否则您将无法轻松找到匹配项.SQL Server 有 DIFFERENCE() 函数来告诉你两个 SOUNDEX 值相差多少,但我认为 MySQL 没有内置这种类型.
  • 对于 MySQL,至少根据 文档, Unicode 输入的 SOUNDEX 被破坏

示例:

SELECT SOUNDEX('微软')SELECT SOUNDEX('Microsift')SELECT SOUNDEX('Microsift Corporation')SELECT SOUNDEX('Microsift 子公司')/* 所有这些都返回 'M262' */

对于更高级的需求,我认为您需要查看Levenshtein distance(也称为两个字符串的编辑距离")并使用阈值.这是更复杂(=更慢)的解决方案,但它允许更大的灵活性.

主要缺点是,您需要两个字符串来计算它们之间的距离.使用 SOUNDEX,您可以在表中存储预先计算的 SOUNDEX,并对其进行比较/排序/分组/过滤.使用 Levenshtein 距离,您可能会发现Microsoft"和Nzcrosoft"之间的差异仅为 2,但要获得该结果需要更多时间.

无论如何,可以在 codejanitor.com 上找到 MySQL 的示例 Levenshtein 距离函数:Levenshtein 距离作为 MySQL 存储函数(2007 年 2 月 10 日).

My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java

解决方案

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

  • its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
  • the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
  • for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

相关文章