使用 PHP 进行文本挖掘

2022-01-02 00:00:00 nlp nltk weka php data-mining

我正在为我正在上的大学课程做一个项目.

I'm doing a project for a college class I'm taking.

我正在使用 PHP 构建一个简单的网络应用程序,该应用程序根据一组字典将推文分为积极"(或快乐)和消极"(或悲伤).我现在想到的算法是朴素贝叶斯分类器或决策树.

I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree.

但是,我找不到任何 PHP 库可以帮助我进行一些严肃的语言处理.Python 有 NLTK(http://www.nltk.org).PHP 有没有类似的东西?

However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP?

我打算使用 WEKA 作为 Web 应用程序的后端(通过在 PHP 中的命令行调用 Weka),但似乎效率不高.

I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient.

你知道我应该在这个项目中使用什么吗?还是我应该切换到 Python?

Do you have any idea what I should use for this project? Or should I just switch to Python?

谢谢

推荐答案

如果您打算使用朴素贝叶斯分类器,您实际上并不需要大量的 NL 处理.您只需要一种算法来阻止推文中的单词,如果需要,还可以删除停用词.

If you're going to be using a Naive Bayes classifier, you don't really need a whole ton of NL processing. All you'll need is an algorithm to stem the words in the tweets and if you want, remove stop words.

词干算法比比皆是,并且不难编码.删除停用词只是搜索哈希映射或类似内容的问题.我认为没有理由切换您的开发平台以适应 NLTK,尽管它是一个非常好的工具.

Stemming algorithms abound and aren't difficult to code. Removing stop words is just a matter of searching a hash map or something similar. I don't see a justification to switch your development platform to accomodate the NLTK, although it is a very nice tool.

相关文章