PHP/MySQL 小规模模糊搜索

2021-12-20 00:00:00 search fuzzy-search php mysql

我希望为一个小型 PHP/MySQL 应用程序实现模糊搜索.具体来说,我有一个包含大约 2400 条记录的数据库(记录以每年大约 600 条的速度添加,因此它是一个小型数据库).三个感兴趣的字段是街道地址、姓氏和日期.我希望能够通过这些字段之一进行搜索,并且基本上可以容忍拼写/字符错误.即,123 Main Street"的地址还应与123 Main St"、123 Main St."、123 Mian St"、123 Man St"、132 Main St"等匹配,名称也是如此和日期.

I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man St", "132 Main St", etc. and likewise for name and date.

我在回答其他类似问题时遇到的主要问题:

The main issues I have with answers to other similar questions:

  • 不可能为每个可能的错误拼写定义同义词,忘记为日期和名称定义同义词.
  • Lucene 等对于如此有限的搜索数据集来说似乎非常重要(称之为最多 5,000 条记录,每条记录 3 个字段).
  • 仅仅使用通配符来处理所有可能的拼写错误似乎不合逻辑.

有什么建议吗?我知道用 MySQL 是不可能在本机上做的,但是由于数据集非常有限,我想保持它相对简单......也许是一个可以获取all 来自数据库的记录,使用某种比较算法,并返回相似记录的 ID?

Any suggestions? I know it isn't going to be possible to do natively with MySQL, but since the data set is so limited, I'd like to keep it relatively simple... perhaps a PHP class that gets all of the records from the DB, uses some sort of comparison algorithm, and returns the IDs of the similar records?

谢谢,杰森

推荐答案

Razzie 的回答(或使用 Damerau–Levenshtein) 根据与搜索关键字的接近程度对候选匹配列表进行排名.(注意:如果键是12 Main St",则13 Main St"与12 Moin St"的打字距离相同,但您可能希望将其排在低位甚至排除它,如 11 和 22 Main St等)

Razzie's answer (or using Damerau–Levenshtein) ranks a list of candidates matches according to their closeness to the search key. (Take care: if the key is "12 Main St" then "13 Main St" has the same typing distance as "12 Moin St" but you might want to rank it low or even exclude it, as with 11 and 22 Main St etc.)

但是你如何选择一个规模可控的候选人名单来进行排名?

But how do you select a list of candidates of a manageable size to rank?

一种方法是为您要搜索的字符串中的每个单词计算变音素值(或值,使用双变音素).使用包含原始字符串的行的 id 将这些变音符中的每一个保存在另一个表中.然后,您可以使用 LIKE 'key%' 快速搜索这些变音位值,其中 key 是搜索文本中单词的变音位.

One way is to compute the metaphone value (or values, using double-metaphone) for each word in the strings your going to search. Save each of these metaphones in another table with the id of the row containing the original string. You can then search these metaphone values quickly with LIKE 'key%' where key is the metaphone of a word from the search text.

在这个主题上查看建议的答案.它非常简洁,对于不是很大的 DB 应该可以很好地工作.

Check out the suggested answer on this thread. It's quite neat and should work nicely for DBs that aren't huge.

相关文章