产生真实单词的词干算法

2022-01-02 00:00:00 nlp php stemming snowball porter-stemmer

我需要选取一段文本并从中提取标签"列表.其中大部分是非常简单的.但是,我现在需要一些帮助来阻止生成的单词列表以避免重复.示例:社区/社区

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities

我使用了 Porter Stemmer 算法的实现(顺便说一下,我是用 PHP 编写的):

I've used an implementation of Porter Stemmer algorithm (I'm writing in PHP by the way):

http://tartarus.org/~martin/PorterStemmer/php.txt

这在一定程度上有效,但不会返回真实"单词.上面的例子是commun".

This works, up to a point, but doesn't return "real" words. The example above is stemmed to "commun".

我尝试过Snowball"(在另一个 Stack Overflow 线程中建议).

I've tried "Snowball" (suggested within another Stack Overflow thread).

http://snowball.tartarus.org/demo.php

就我的示例(社区/社区)而言,Snowball 源于communiti".

For my example (community / communities), Snowball stems to "communiti".

问题

是否有其他词干算法可以做到这一点?有没有其他人解决过这个问题?

Are there any other stemming algorithms that will do this? Has anyone else solved this problem?

我目前的想法是我可以使用词干算法来避免重复,然后选择我遇到的最短单词作为要显示的实际单词.

推荐答案

这里的核心问题是词干算法在语音基础纯粹基于语言的拼写规则而没有实际理解他们正在使用的语言.要生成真实的单词,您可能必须将词干分析器的输出与某种形式的查找函数合并,以将词干转换回真实的单词.我基本上可以看到两种可能的方法来做到这一点:

The core issue here is that stemming algorithms operate on a phonetic basis purely based on the language's spelling rules with no actual understanding of the language they're working with. To produce real words, you'll probably have to merge the stemmer's output with some form of lookup function to convert the stems back to real words. I can basically see two potential ways to do this:

  1. 找到或创建一个大词典,将每个可能的词干映射回实际单词.(例如,社区 -> 社区)
  2. 创建一个函数,将每个词干与缩减为该词干的单词列表进行比较,并尝试确定哪个词最相似.(例如,将communiti"与community"和communities"进行比较,以便将community"视为更相似的选项)

就我个人而言,我认为我的做法是#1 的动态形式,通过记录检查的每个单词及其词干,然后假设最常见的单词是一个,从而建立自定义词典数据库应该使用.(例如,如果我的源文本正文比社区"更频繁地使用社区",则映射社区 -> 社区.)基于字典的方法通常会更准确,并且基于词干分析器输入构建它会提供结果根据您的文本进行定制,主要缺点是需要空间,现在这通常不是问题.

Personally, I think the way I would do it would be a dynamic form of #1, building up a custom dictionary database by recording every word examined along with what it stemmed to and then assuming that the most common word is the one that should be used. (e.g., If my body of source text uses "communities" more often than "community", then map communiti -> communities.) A dictionary-based approach will be more accurate in general and building it based on the stemmer input will provide results customized to your texts, with the primary drawback being the space required, which is generally not an issue these days.

相关文章