如何找到相似的结果并按相似度排序?
如何查询按相似度排序的记录?
例如.搜索库存溢出"将返回
Eg. searching for "Stock Overflow" would return
- 堆栈溢出
- SharePoint 溢出
- 数学溢出
- 政治泛滥
- 视觉特效溢出
例如.搜索LO"将返回:
Eg. searching for "LO" would return:
- pabLO 毕加索
- 米开朗琪罗
- 杰克逊波洛克
<小时>
我需要什么帮助:
使用搜索引擎索引 &搜索 MySQL 表,以获得更好的结果
Using a search engine to index & search a MySQL table, for better results
使用 Sphinx 搜索引擎和 PHP
在 PHP 中使用 Lucene 引擎
Using the Lucene engine with PHP
使用全文索引查找相似/包含的字符串
Using full-text indexing, to find similar/containing strings
<小时>
哪些地方做得不好
- Levenshtein distance 非常不稳定.(UDF, 查询)
搜索狗"给了我:
What does not work well
- Levenshtein distance is very erratic. (UDF, Query)
Searching for "dog" gives me:- 狗
- 沼泽
- 以前
- 大
- 回声
LIKE
返回更好的结果,但对于长查询不返回任何内容,尽管确实存在类似的字符串LIKE
returns better results, but returns nothing for long queries although similar strings do exist- 狗
- dogid
- 多加拉尔
- 教条
推荐答案
我发现当您针对另一个完整字符串搜索完整字符串时,Levenshtein 距离可能不错,但是当您在字符串中查找关键字时,此方法不返回(有时)想要的结果.而且,SOUNDEX 功能不适用于英语以外的语言,因此非常有限.您可以使用 LIKE,但它确实适用于基本搜索.您可能想要查看其他搜索方法以了解您想要实现的目标.例如:
I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
您可以使用 Lucene 作为您项目的搜索基础.它以大多数主要的编程语言实现,而且速度非常快且用途广泛.这种方法可能是最好的,因为它不仅搜索子字符串,还搜索字母换位、前缀和后缀(全部组合).但是,您需要保留一个单独的索引(尽管偶尔可以使用 CRON 从独立脚本更新它).
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
或者,如果你想要一个 MySQL 解决方案,全文功能非常好,而且肯定比存储过程更快.如果你的表不是 MyISAM,你可以创建一个临时表,然后执行全文搜索:
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
CREATE TABLE IF NOT EXISTS `tests`.`data_table` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(2000) CHARACTER SET latin1 NOT NULL, `description` text CHARACTER SET latin1 NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
如果您不想打扰创建它,请使用 数据生成器 来生成一些随机数据自己...
Use a data generator to generate some random data if you don't want to bother creating it yourself...
** 注意 ** :列类型应为
latin1_bin
以执行区分大小写的搜索,而不是使用latin1
不区分大小写.对于 unicode 字符串,我推荐utf8_bin
用于区分大小写,utf8_general_ci
用于不区分大小写的搜索.** NOTE ** : the column type should be
latin1_bin
to perform a case sensitive search instead of case insensitive withlatin1
. For unicode strings, I would recommendutf8_bin
for case sensitive andutf8_general_ci
for case insensitive searches.DROP TABLE IF EXISTS `tests`.`data_table_temp`; CREATE TEMPORARY TABLE `tests`.`data_table_temp` SELECT * FROM `tests`.`data_table`; ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM; ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` ( `title` , `description` ); SELECT *, MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score` FROM `tests`.`data_table_temp` WHERE MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) ORDER BY `score` DESC; DROP TABLE `tests`.`data_table_temp`;
从 MySQL API 参考页面一个>
这样做的缺点是它不会寻找字母换位或相似,听起来像"的词.
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
** 更新 **
使用 Lucene 进行搜索,您只需要创建一个 cron 作业(所有 Web 主机都具有此功能"),该作业将简单地执行一个 PHP 脚本(ig "cd/path/to/script; php searchindexer.php") 将更新索引.原因是索引数千个文档"(行、数据等)可能需要几秒钟甚至几分钟的时间,但这是为了确保尽可能快地执行所有搜索.因此,您可能希望创建一个由服务器运行的延迟作业.可能是一夜之间,也可能是下一小时,这取决于您.PHP 脚本应如下所示:
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
$indexer = Zend_Search_Lucene::create('/path/to/lucene/data'); Zend_Search_Lucene_Analysis_Analyzer::setDefault( // change this option for your need new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); $rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index foreach ($rowSet as $row) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8')) ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8')) ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable)) ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8')) ; $indexer->addDocument($doc); } // ... you can get as many $rowSet as you want and create as many documents // as you wish... each document doesn't necessarily need the same fields... // Lucene is pretty flexible on this $indexer->optimize(); // do this every time you add more data to you indexer... $indexer->commit(); // finalize the process
然后,这基本上就是你搜索的方式(基本搜索):
Then, this is basically how you search (basic search) :
$index = Zend_Search_Lucene::open('/path/to/lucene/data'); // same search options Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8'); $query = 'php +field1:foo'; // search for the word 'php' in any field, // +search for 'foo' in field 'field1' $hits = $index->find($query); $numHits = count($hits); foreach ($hits as $hit) { $score = $hit->score; // the hit weight $field1 = $hit->field1; // etc. }
这里是关于 Java 中的 Lucene 的好网站,PHP 和 .Net.
Here are great sites about Lucene in Java, PHP, and .Net.
总结每种搜索方法都有其优缺点:
In conclusion each search methods have their own pros and cons :
- 你提到了 Sphinx 搜索,看起来很不错,只要你能让守护进程在你的网络主机.
- Zend Lucene 需要一个 cron 作业来重新索引数据库.虽然它对用户非常透明,但这意味着任何新数据(或删除的数据!)并不总是与数据库中的数据同步,因此不会立即显示在用户搜索中.
- MySQL FULLTEXT 搜索既好又快,但不会为您提供前两者的所有功能和灵活性.
- You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
- Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
- MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.
如果我忘记/遗漏了什么,请随时发表评论.
Please feel free to comment if I have forgotten/missed anything.
- Levenshtein distance is very erratic. (UDF, Query)
相关文章