在 Python 中使用斯坦福正则表达式

2022-01-18 00:00:00 python subprocess parsing stanford-nlp pattern-matching

问题描述

我是 NLP 和 Python 的新手.我正在尝试使用 Tregex 工具和 Python 子进程库从来自 StanfordCoreNLP 的解析树中提取名词短语的子集.特别是，我正在尝试查找和提取与以下模式匹配的名词短语:'(NP[$VP]>S)|(NP[$VP]>S )|(NP [$VP]>S)|(NP [$VP]>S )' 在 Tregex 语法中.

I'm a newbie in NLP and Python. I'm trying to extract a subset of noun phrases from parsed trees from StanfordCoreNLP by using the Tregex tool and the Python subprocess library. In particular, I'm trying to find and extract noun phrases that match the following pattern: '(NP[$VP]>S)|(NP[$VP]>S )|(NP [$VP]>S)|(NP [$VP]>S )' in the Tregex grammar.

例如，下面是原始文本，保存在名为text"的字符串中:

For example, below is the original text, saved in a string named "text":

text = ('Pusheen and Smitha walked along the beach. "I want to surf", said Smitha, the CEO of Tesla. However, she fell off the surfboard')

使用 Python 包装器运行 StanfordCoreNLP 解析器后，我得到了以下 3 个树的 3 个句子:

After running the StanfordCoreNLP parser using the Python wrapper, I got the following 3 trees for the 3 sentences:

output1['sentences'][0]['parse'] Out[58]: '(ROOT (S (NP (NNP Pusheen) (CC and) (NNP Smitha)) (VP (VBD walked) (PP (IN along) (NP (DT the) (NN beach)))) (. .)))' output1['sentences'][1]['parse'] Out[59]: "(ROOT (SINV (`` ``) (S (NP (PRP I)) (VP (VBP want) (PP (TO to) (NP (NN surf) ('' ''))))) (, ,) (VP (VBD said)) (NP (NP (NNP Smitha)) (, ,) (NP (NP (DT the) (NNP CEO)) (PP (IN of) (NP (NNP Tesla))))) (. .)))" output1['sentences'][2]['parse'] Out[60]: '(ROOT (S (ADVP (RB However)) (, ,) (NP (PRP she)) (VP (VBD fell) (PRT (RP off)) (NP (DT the) (NN surfboard)))))'

我想提取以下 3 个名词短语(每个句子一个)并将它们保存为 Python 中的变量(或标记列表):

I would like to extract the following 3 noun phrases (one for each sentence) and save them as variables (or lists of tokens) in Python:

(NP (NNP Pusheen) (CC and) (NNP Smitha))
(NP (PRP I))
(NP(PRP 她))

(NP (NNP Pusheen) (CC and) (NNP Smitha))

(NP (PRP I))

(NP (PRP she))

为了您的信息，我在命令行中使用了 tregex，代码如下:

For your information, I have used of tregex from the command-line with the following code:

cd stanford-tregex-2016-10-31 java -cp 'stanford-tregex.jar:' edu.stanford.nlp.trees.tregex.TregexPattern -f -s '(NP[$VP]>S)|(NP[$VP]>S )|(NP [$VP]>S)|(NP [$VP]>S )' /Users/AS/stanford-tregex-2016-10-31/exampletree.txt

输出是:

Pattern string: (NP[$VP]>S)|(NP[$VP]>S )|(NP [$VP]>S)|(NP [$VP]>S ) Parsed representation: or Root NP and $ VP > S Root NP and $ VP > S Root NP and $ VP > S Root NP and $ VP > S Reading trees from file(s) file path # /Users/AS/stanford-tregex-2016-10-31/exampletree.txt (NP (NNP Pusheen) (CC and) (NNP Smitha)) # /Users/AS/stanford-tregex-2016-10-31/exampletree.txt (NP (NP (NNP Smitha)) (, ,) (NP (NP (DT the) (NN spokesperson)) (PP (IN of) (NP (DT the) (NNP CIA)))) (, ,)) # /Users/AS/stanford-tregex-2016-10-31/exampletree.txt (NP (PRP They)) There were 3 matches in total.

如何在 Python 中复制此结果?

How can I replicate this result in Python?

供您参考，我通过 Google 找到了以下帖子，该帖子与我的问题相关但已过时(https://mailman.stanford.edu/pipermail/parser-user/2010-July/000606.html):

For your reference, I found the following post via Google, which is relevant to my question but outdated (https://mailman.stanford.edu/pipermail/parser-user/2010-July/000606.html):

[parser-user] Tregex 的变量输入

[parser-user] Variable input to Tregex

斯坦福大学的克里斯托弗·曼宁曼宁2010 年 7 月 7 日星期三 17:41:32 PDT海阳，

Christopher Manning manning at stanford.edu Wed Jul 7 17:41:32 PDT 2010 Hi Haiyang,

抱歉，回复慢，学年末太忙了.

Sorry, slow reply, things are too busy at the end of the academic year.

2010 年 6 月 1 日晚上 8 点 56 分，海阳 AI 写道:

On Jun 1, 2010, at 8:56 PM, Haiyang AI wrote:

亲爱的，

我希望这是寻求帮助的正确地方.

I hope this is the right place to seek help.

确实如此，尽管我们只能对任何特定于 Python 的事情提供非常有限的帮助.....

It is, though we can only give very limited help on anything Python specific.....

但这似乎很简单(我认为).

But this seems to be straightforward (I think).

如果您想要在通过标准输入输入的树上运行模式，则需要在NP"之前在参数列表中添加标志-filter".

If what you're wanting is for the pattern to be run on trees being fed in over stdin, you need to add the flag "-filter" in the argument list prior to "NP".

如果模式之后没有指定文件，并且没有给出标志-filter"，那么它会在固定的默认句子上运行模式......

If no file is specified after the pattern, and the flag "-filter" is not given, then it runs the pattern on a fixed default sentence....

克里斯.

我正在做一个与 Tregex 相关的项目.我正在尝试从 python 调用 Tregex，但我不知道如何将数据输入 Tregex，而不是来自传统文件，而是来自变量.例如，我正在尝试使用以下代码计算给定变量(例如文本、已解析树、使用斯坦福解析器)中的NP"数，

I'm working on a project related to Tregex. I'm trying to call Tregex from python, but I don't know how to feed data into Tregex, not from conventional file, but from a variable. For example, I'm trying to count the number of "NP" from a given variable (e.g. text, already parsed tree, using Stanford Parser), with the following code,

def 正则表达式(文本):
tregex_dir = "/root/nlp/stanford-tregex-2009-08-30/"op = Popen([java"，-mx900m"，-cp"，stanford-tregex.jar:"，edu.stanford.nlp.trees.tregex.TregexPattern"，NP"]，cwd = tregex_dir，标准输出 = 管道，标准输入 = 管道，标准错误 = 标准输出)res = op.communicate(输入=文本)[0]返回资源

def tregex(text):
tregex_dir = "/root/nlp/stanford-tregex-2009-08-30/" op = Popen(["java", "-mx900m", "-cp", "stanford-tregex.jar:", "edu.stanford.nlp.trees.tregex.TregexPattern", "NP"], cwd = tregex_dir, stdout = PIPE, stdin = PIPE, stderr = STDOUT) res = op.communicate(input=text)[0] return res

结果如下.它没有从变量中搜索内容，而是以某种方式退回到使用默认树".谁能帮我一把?我已经被困在这里很长时间了.非常感谢您的时间和帮助.模式字符串:NP解析表示:根 NP使用默认树(NP(NP(DT本)(NN酒))(抄送和)(NP(DT这些)(NNS蜗牛)))

The results are like the following. It didn't search the content from the variable, but somehow falling back to "using default tree". Can anyone give me a hand? I have been stuck here for quite a long time. Really appreciate your time and help. Pattern string: NP Parsed representation: Root NP using default tree (NP (NP (DT this) (NN wine)) (CC and) (NP (DT these) (NNS snails)))

(NP (DT this) (NN wine))

(NP (DT this) (NN wine))

(NP(DT这些)(NNS蜗牛))

(NP (DT these) (NNS snails))

总共有 3 场比赛.

--海阳艾，博士学生应用语言学系宾夕法尼亚州立大学

-- Haiyang AI, Ph.D. student Department of Applied Linguistics The Pennsylvania State University

解析器用户邮件列表list.stanford.edu 的解析器用户https://mailman.stanford.edu/mailman/listinfo/parser-user

parser-user mailing list parser-user at lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/parser-user

解决方案

为什么不用斯坦福 CoreNLP 服务器！

Why not use the Stanford CoreNLP server!

1.) 启动服务器！

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 - timeout 15000

2.) 发出 python 请求！

2.) Make a python request!

import requests url = "http://localhost:9000/tregex" request_params = {"pattern": "(NP[$VP]>S)|(NP[$VP]>S\n)|(NP\n[$VP]>S)|(NP\n[$VP]>S\n)"} text = "Pusheen and Smitha walked along the beach." r = requests.post(url, data=text, params=request_params) print r.json()

3.) 这是结果！

{u'sentences': [{u'0': {u'namedNodes': [], u'match': u'(NP (NNP Pusheen) (CC and) (NNP Smitha)) '}}]}

相关文章