使用 solr 构建标签云

尊敬的 stackoverflow 社区:

给定一些文本,我希望得到文本中最常用的 TOP 50 词,并从中创建一个标签云,从而以图形方式显示文本的要点.

文本实际上是一组 100 条左右的评论,每个 ITEM(一张图片)大约有 120 条,我还想保持云更新 - 通过保持评论索引,并在每次出现新的 Web 请求时使用云生成代码运行.

我决定使用 Solr 来索引文本,现在想知道如何从 Solr 中获取 TOP 50 单词 TermsVectorComponant.以下是术语向量组件返回的结果示例,在您通过说 tv.tf="true" 打开术语频率后:

 <lst name="doc-5"><str name="uniqueKey">MA147LL/A</str><lst name="包括"><lst name="cbl"><tf>5</tf></lst><lst name="earbud"><tf>3</tf></lst><lst name="headphon"><tf>10</tf></lst><lst name="usb"><tf>11</tf></lst></lst></lst><lst name="doc-9"><str name="uniqueKey">3007WFP</str><lst name="包括"><lst name="cbl"><tf>5</tf></lst><lst name="usb"><tf>4</tf></lst></lst></lst>

如您所见,我有 2 个问题:

  1. 我获得了文档中针对该字段的所有术语,而不仅仅是前 100 个术语
  2. 而且它们不是按频率排序的,所以我必须获取术语并在内存中对其进行排序以执行我正在尝试的操作.

有没有更好的方法?(或)我可以告诉 solr termvector 组件以某种方式对其进行排序并为我只提取 100 个吗?(或)我可以使用其他一些框架吗?我需要在新评论出现时对其进行索引,因此标签云始终是最新的 - 至于云生成器,它需要一个加权词词典,并将其变成一个漂亮的图像.

这个答案没有帮助.

编辑 - 尝试 jpountz &佩奇厨师的回答

这是我为此查询得到的结果:

 select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true&facet.field=Post_Content&facet.minCount=1&facet.limit=50<int name="also">1</int><int name="ani">1</int><int name="anoth">1</int><int name="atleast">1</int><int name="base">1</int><int name="bcd">1</int><int name="因为">1</int><int name="更好">1</int><int name="更大">1</int><int name="bio">1</int><int name="boot">1</int><int name="bootable">1</int><int name="bootload">1</int><int name="bootscreen">1</int>

我得到了 50 个这样的元素,@jpountz 感谢帮助限制结果,但是为什么所有 50 个单独的 <int> 元素都保持值 1?我的想法是:数字 1 代表与我的查询匹配的文档数(由于我通过 Id:Guid 查询,因此只能是一个),它们不代表 Post_Content 中单词的频率/p>

为了证明这一点,我从查询中删除了 Id:GUID,结果是:

<int name="content">33</int><int name="can">17</int><int name="on">16</int><int name="so">16</int><int name="some">16</int><int name="all">15</int><int name="i">15</int><int name="do">14</int><int name="have">14</int><int name="我的">14</int>

我的问题是如何获取文档中的词频,而不是许多词的文档频率.例如,我知道 bootable 是我在 Post_content 中使用了 6 次的一个词,所以我想要对一组文档进行排序,例如 (6,"bootable"), (5, "disc").

解决方案

我想出了一个 STOPGAP 解决方案:(为了举例,我将每个 solr 文档称为帖子")

Solr 中有一个术语组件,其目的似乎是公开任何给定字段的所有索引术语.它主要用于实现自动完成等功能,以及在术语级别运行的其他功能.默认情况下按频率排序 - 字段中出现频率较高的术语首先出现.

我所做的是创建了一个名为 content_ 的动态字段,并根据类别在其自己的字段中索引每个帖子集.这意味着将有数百个动态字段实例,每个实例都包含一个 post-set,我可以使用该字段上的 terms 组件来获取该 post-set 的 TOP TERMS.

作为图片:

content_postSetOne :包含一组帖子的索引版本content_postSetTwo :包含另一组帖子的索引版本content_postSetThree :包含第三组帖子的索引版本

这个解决方案有点适合我,如果需要,您也可以轻松地为每个帖子创建一个字段.我也有兴趣了解使用这样的动态字段的含义:这会是一个问题吗?

这与 Paige 和 jPountz 的答案有何不同:

  1. 词频是A"或A Set of Docs"中的词数,而不是包含该词的文档数.
  2. 我可以从一个文档中获取出现频率最高的术语,如果需要,还可以从一组文档中获取.
  3. 我没有使用分面,因为它主要根据文档数量给出频率,而不是根据单词出现的次数(与哪个文档无关).

Dear stackoverflow community :

Given some text, I wish to get the TOP 50 most frequent words in the text, and create a tag cloud out of it, and thus show the gist of what the text is about in a graphical way.

The text is actually a set of 100 or so comments PER each ITEM(a picture) there are about 120 items, and I also want to keep the cloud updated - by keeping the comments indexed, and using the cloud generation code to run each time a new web request turns up.

I settled on using Solr to index the text, and now wondering how to get the TOP 50 words, out of Solr TermsVectorComponant. Here is an example of the results returned by the terms vector componant, after you turn on term frequency by saying tv.tf="true" :

  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>    
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="earbud"><tf>3</tf></lst>
      <lst name="headphon"><tf>10</tf></lst>
      <lst name="usb"><tf>11</tf></lst>
    </lst>
  </lst>

  <lst name="doc-9">
    <str name="uniqueKey">3007WFP</str>
    <lst name="includes">
      <lst name="cabl"><tf>5</tf></lst>
      <lst name="usb"><tf>4</tf></lst>
    </lst>
  </lst>

As you can see I have 2 problems :

  1. I get all the terms within the document, for that field, not just top 100
  2. And They are not sorted by frequency, so I have to get terms and sort it in-memory to do what im trying.

Is there a better way? (or) Can I tell solr termvector component to somehow sort it and pick up only 100 for me? (or) Is there some other framework which I can use? I need to keep new comments indexed as they come, so the tag cloud is always uptodate - As to the cloud generator it takes a dictionary of weighted words, and makes it into a nice image.

This answer does not help.

EDIT - trying out jpountz & paige cook's answer

Here is a result which I got for this query :

    select?q=Id:d4439543-afd4-42fb-978a-b72eab0c07f9&facet=true
&facet.field=Post_Content&facet.minCount=1&facet.limit=50

<int name="also">1</int>
<int name="ani">1</int>
<int name="anoth">1</int>
<int name="atleast">1</int>
<int name="base">1</int>
<int name="bcd">1</int>
<int name="becaus">1</int>
<int name="better">1</int>
<int name="bigger">1</int>
<int name="bio">1</int>
<int name="boot">1</int>
<int name="bootabl">1</int>
<int name="bootload">1</int>
<int name="bootscreen">1</int>

I got 50 such elements, @jpountz thanks for helping limit the results, BUT why does ALL FIFTY of the individual <int> elements hold the value 1? My thoughts are : The number 1 represents the count of the documents matching my query (which can only be one since I queried by Id:Guid) and they do not represent the frequency of the words in Post_Content

To prove this I removed the Id:GUID from query and result was:

<int name="content">33</int>
<int name="can">17</int>
<int name="on">16</int>
<int name="so">16</int>
<int name="some">16</int>
<int name="all">15</int>
<int name="i">15</int>
<int name="do">14</int>
<int name="have">14</int>
<int name="my">14</int>

My problem is how to get the term frequency in the document, and not the document frequency of many terms. For example I know for a fact that bootable was a word I used 6 times in Post_content, So i want sorted Pairs like (6,"bootable"), (5, "disc") for a set of documents.

解决方案

I have come up with a STOPGAP solution : (Im calling a each solr document a "post" for examples sake)

There is a terms component in Solr, whose purpose seems to be to expose all the indexed terms of any given field. It is mainly used to implement features like auto-complete, and other features that operate at a term level. And it is by default sorted by frequency - the more frequently occurring terms in the field come up first.

What I have done is created a dynamic field called content_ and indexed each post-set in its own field based on category. This means that there will be hundreds of instances of the dynamic field each containing one post-set, and I can use the terms component on that field to get TOP TERMS for that post-set.

As a picture :

content_postSetOne : contains indexed version of a set of posts
content_postSetTwo : contains indexed version of another set of posts
content_postSetThree : contains indexed version of a third set of posts

This solution is sort of working for me, and you can easily create a field per Post also if needed. Im also interested in knowing the implications of using dynamic fields like this : Will this be a problem?

How this is different from the Paige and jPountz answer is :

  1. The term frequency is the count of words in "A" or "A Set of Docs" and not the count of number of docs containing the term.
  2. I can get the top occurring terms from within ONE document, and if needed also from A Set of documents.
  3. I did not use faceting because it primarily gives the frequency in terms of number of docs and not in terms of number of times the word occurred irrespective of which doc.

相关文章