SQL 搜索的关键字

用例

当用户访问我的网站时,他们会遇到一个类似于 SO 的搜索框.他们可以使用计划文本搜索结果..net 问题"、封闭式问题"、.net 和 java"等.搜索的功能与 SO 略有不同,因为它将尽可能多地尝试数据库的架构,而不是直接全文搜索.所以.net 问题"只会搜索 .net 问题而不是 .net 答案(可能不适用于 SO 案例,这里只是一个例子),封闭问题"将返回封闭的问题,.net 和 java"问题将返回与 .net 和 java 相关的问题,仅此而已.

When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.

问题

我对这些词不太熟悉,但我基本上想为 SQL 驱动的搜索做一个关键字.我知道数据库的架构,我也可以对数据库进行数据挖掘.我想知道在我尝试实施之前已经存在的任何当前方法.我想这个问题是针对所述问题的良好设计.

I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.

建议

到目前为止我提出的解决方案看起来像这样

My proposed solution so far looks something like this

  1. 清理输入.只需删除任何特殊字符
  2. 将输入解析为数据块.将c# java"的输入拆分成c#和java 还处理特殊情况,比如'c# java' questions"变成'c# java'和questions".
  3. 根据输入构建一棵树
  4. 将数据绑定到元数据中.因此,转换诸如封闭式问题之类的内容,并将其与表格的 isclosed 列相关联.
  5. 将树转换为 sql 查询.

想法/建议/链接?

推荐答案

我经营一家数字音乐商店,采用单一搜索",根据关键字的出现次数和产品出现的模式对关键字进行加权,例如.具有不同的列,如艺术家"、标题"或出版商".

I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".

产品也与专辑和播放列表有关,但为了更简单的说明,我只详细说明产品关键字的索引和查询.

Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.

Keywords 表 - 每个可能被搜索的词的加权表(因此,它在某处被引用),每个记录的数据如下:

Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:

  • 关键字 ID(不是单词),
  • 圣言本身,
  • 单词的 Soundex Alpha 值
  • 重量

ProductKeywords 表 - 任何产品字段(或列)引用的每个关键字的加权表,每条记录包含以下数据:

ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:

  • 产品 ID,
  • 关键字 ID,
  • 重量

权重值表示单词出现的频率.匹配权重较低的关键字更独特",更有可能成为搜索对象.这样,经常出现的词会自动降低权重",例如.the"、a"或I".但是,最好在索引之前去除那些常见词的原子出现.

The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.

我使用整数进行加权,但使用十进制值会提供更多功能,排序可能会稍微慢一些.

I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.

每当更新任何产品字段时,例如.艺术家或标题(这种情况不经常发生),数据库触发器在交易中重新索引产品的关键字:

Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:

  1. 如果不再引用,所有产品关键字都将被取消关联并删除.
  2. 每个索引字段(例如艺术家)值都作为关键字整体存储/检索,并与 ProductKeywords 表中的产品相关以进行直接匹配.
  3. 然后关键字权重会增加一个取决于字段重要性的值.您可以根据字段的重要性添加、减去权重.如果艺术家比标题更重要,则从其 ProductKeyword 权重调整中减去 1 或 2.
  4. 每个索引字段值都被去除了任何非字母数字字符并拆分为单独的词组,例如.比利乔尔"变成了比利"和乔尔".
  5. 每个字段值的每个单独的词组都被解析和存储/检索为关键字,并以与步骤 2 中相同的方式与产品相关联.如果关键字已经与产品相关联,则只需调整其权重.
  1. All product keywords are disassociated and deleted if no longer referenced.
  2. Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
  3. The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
  4. Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
  5. Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.

查询

  1. 获取完整的输入查询搜索字符串并查找直接匹配的关键字.在内存表中检索关键字的所有 ProductKeywords 以及关键字权重(与 ProductKeyword 权重不同).
  2. 去除所有非字母数字字符并将查询拆分为关键字.检索所有现有关键字(只有少数匹配).将 ProductKeywords 与与 ProductKeyword 权重不同的 Keyword 权重与内存表中的关键字相匹配.
  3. 重复第 2 步,但使用 soundex 值,调整权重以降低相关性.
  4. 将检索到的 ProductKeywords 加入到它们的相关产品中,并检索每个产品的销售额,这是受欢迎程度的衡量标准.
  5. 按关键字权重、产品关键字权重和销售额对结果进行排序.最终的总和/排序和/或加权取决于您的实施.
  6. 限制结果并将产品搜索结果返回给客户.

相关文章