pyparsing 一种查询格式到另一种

2022-01-15 00:00:00 python lucene pyparsing pubmed

问题描述

我很茫然.我一直试图让这个工作好几天了.但是我对此无能为力,所以我想我会在这里咨询你们,看看是否有人能够帮助我!

I am at a loss. I have been trying to get this to work for days now. But I am not getting anywhere with this, so I figured I'd consult you guys here and see if someone is able to help me!

我正在使用 pyparsing 尝试将一种查询格式解析为另一种格式.这不是一个简单的转变,但实际上需要一些脑筋:)

I am using pyparsing in an attempt to parse one query format to another one. This is not a simple transformation but actually takes some brains :)

当前查询如下:

("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments] 
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title] 
OR breast cancer[Body - All Words] OR breast cancer[Title] 
OR breast cancer[Abstract] OR breast cancer[Journal]) 
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption] 
OR prevention[Section Title] OR prevention[Body - All Words] 
OR prevention[Title] OR prevention[Abstract])

并且使用 pyparsing 我已经能够得到以下结构:

And using pyparsing I have been able to get the following structure:

[[[['"', 'breast', 'neoplasms', '"'], ['MeSH', 'Terms']], 'or',
[['breast', 'cancer'], ['Acknowledgments']], 'or', [['breast', 'cancer'],
['Figure/Table', 'Caption']], 'or', [['breast', 'cancer'], ['Section', 
'Title']], 'or', [['breast', 'cancer'], ['Body', '-', 'All', 'Words']], 
'or', [['breast', 'cancer'], ['Title']], 'or', [['breast', 'cancer'], 
['Abstract']], 'or', [['breast', 'cancer'], ['Journal']]], 'and', 
[[['prevention'], ['Acknowledgments']], 'or', [['prevention'], 
['Figure/Table', 'Caption']], 'or', [['prevention'], ['Section', 'Title']], 
'or', [['prevention'], ['Body', '-', 'All', 'Words']], 'or', 
[['prevention'], ['Title']], 'or', [['prevention'], ['Abstract']]]]

但现在,我不知所措.我需要将上述输出格式化为 lucene 搜索查询.以下是有关所需转换的简短示例:

But now, I am at a loss. I need to format the above output to a lucene search query. Here is a short example on the transformations required:

"breast neoplasms"[MeSH Terms] --> [['"', 'breast', 'neoplasms', '"'], 
['MeSH', 'Terms']] --> mesh terms: "breast neoplasms"

但我被困在了那里.我还需要能够使用特殊词 AND 和 OR.

But I am stuck right there. I also need to be able to make use of the special words AND and OR.

所以最后的查询可能是:网格术语:乳房肿瘤"和预防

so a final query might be: mesh terms: "breast neoplasms" and prevention

谁能帮助我并给我一些关于如何解决这个问题的提示?任何形式的帮助将不胜感激.

Who can help me and give me some hints on how to solve this? Any kind of help would be appreciated.

由于我使用的是 pyparsing,所以我很喜欢 python.我已经粘贴了下面的代码,这样你就可以玩弄它而不必从 0 开始!

Since I am using pyparsing, I am bount to python. I have pasted the code below so that you can play around with it and dont have to start at 0!

非常感谢您的帮助!

def PubMedQueryParser():
    word = Word(alphanums +".-/&§")
    complex_structure = Group(Literal('"') + OneOrMore(word) + Literal('"')) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    medium_structure = Group(OneOrMore(word)) + Suppress('[') + Group(OneOrMore(word)) + Suppress(']')
    easy_structure = Group(OneOrMore(word))
    parse_structure = complex_structure | medium_structure | easy_structure
    operators = oneOf("and or", caseless=True)
    expr = Forward()
    atom = Group(parse_structure) + ZeroOrMore(operators + expr)
    atom2 = Group(Suppress('(') + atom + Suppress(')')) + ZeroOrMore(operators + expr) | atom
    expr << atom2
    return expr


解决方案

嗯,你已经有了一个不错的开始.但是从这里开始,很容易陷入解析器调整的细节中,而且你可能会在这种模式下好几天.让我们从原始查询语法开始逐步解决您的问题.

Well, you have gotten yourself off to a decent start. But from here, it is easy to get bogged down in details of parser-tweaking, and you could be in that mode for days. Let's step through your problem beginning with the original query syntax.

当您开始这样的项目时,请编写您要解析的语法的 BNF.它不必非常严格,事实上,这是基于我从您的样本中看到的一个开始:

When you start out with a project like this, write a BNF of the syntax you want to parse. It doesn't have to be super rigorous, in fact, here is a start at one based on what I can see from your sample:

word :: Word('a'-'z', 'A'-'Z', '0'-'9', '.-/&§')
field_qualifier :: '[' word+ ']'
search_term :: (word+ | quoted_string) field_qualifier?
and_op :: 'and'
or_op :: 'or'
and_term :: or_term (and_op or_term)*
or_term :: atom (or_op atom)*
atom :: search_term | ('(' and_term ')')

这非常接近 - 我们在 wordand_opor_op 表达式之间可能存在一些歧义,因为 'and'和或"匹配一个词的定义.我们需要在实施时加强这一点,以确保癌症或癌或淋巴瘤或黑色素瘤"被解读为由或"分隔的 4 个不同的搜索词,而不仅仅是一个大词(我认为这是您当前的解析器会做).我们还获得了识别运算符优先级的好处——也许不是绝对必要的,但我们现在就开始吧.

That's pretty close - we have a slight problem with some possible ambiguity between word and the and_op and or_op expressions, since 'and' and 'or' do match the definition of a word. We'll need to tighten this up at implementation time, to make sure that "cancer or carcinoma or lymphoma or melanoma" gets read as 4 different search terms separated by 'or's, not just one big term (which I think is what your current parser would do). We also get the benefit of recognizing precedence of operators - maybe not strictly necessary, but let's go with it for now.

转换为 pyparsing 很简单:

Converting to pyparsing is simple enough:

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = Word(alphanums + '.-/&')

field_qualifier = LBRACK + OneOrMore(word) + RBRACK
search_term = ((Group(OneOrMore(word)) | quoted_string)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

为了解决 'or' 和 'and' 的歧义,我们在单词的开头放置了一个否定的lookahead:

To address the ambiguity of 'or' and 'and', we put a negative lookahead at the beginning of word:

word = ~(and_op | or_op) + Word(alphanums + '.-/&')

为了给结果一些结构,包装在 Group 类中:

To give some structure to the results, wrap in Group classes:

field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)
search_term = Group(Group(OneOrMore(word) | quotedString)('search_text') +
                          Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = Group(atom + ZeroOrMore(or_op + atom))
and_term = Group(or_term + ZeroOrMore(and_op + or_term))
expr << and_term

现在解析您的示例文本:

Now parsing your sample text with:

res = expr.parseString(test)
from pprint import pprint
pprint(res.asList())

给予:

[[[[[[['"breast neoplasms"'], ['MeSH', 'Terms']],
     'or',
     [['breast', 'cancer'], ['Acknowledgments']],
     'or',
     [['breast', 'cancer'], ['Figure/Table', 'Caption']],
     'or',
     [['breast', 'cancer'], ['Section', 'Title']],
     'or',
     [['breast', 'cancer'], ['Body', '-', 'All', 'Words']],
     'or',
     [['breast', 'cancer'], ['Title']],
     'or',
     [['breast', 'cancer'], ['Abstract']],
     'or',
     [['breast', 'cancer'], ['Journal']]]]],
  'and',
  [[[[['prevention'], ['Acknowledgments']],
     'or',
     [['prevention'], ['Figure/Table', 'Caption']],
     'or',
     [['prevention'], ['Section', 'Title']],
     'or',
     [['prevention'], ['Body', '-', 'All', 'Words']],
     'or',
     [['prevention'], ['Title']],
     'or',
     [['prevention'], ['Abstract']]]]]]]

实际上,与解析器的结果非常相似.我们现在可以通过此结构递归并构建新的查询字符串,但我更喜欢使用解析对象来执行此操作,在解析时通过将类定义为令牌容器而不是 Groups 来创建,然后添加对类的行为以获得我们想要的输出.区别在于我们解析的对象令牌容器可以具有特定于被解析的表达式类型的行为.

Actually, pretty similar to the results from your parser. We could now recurse through this structure and build up your new query string, but I prefer to do this using parsed objects, created at parse time by defining classes as token containers instead of Groups, and then adding behavior to the classes to get our desired output. The distinction is that our parsed object token containers can have behavior that is specific to the kind of expression that was parsed.

我们将从一个基本抽象类 ParsedObject 开始,它将解析后的标记作为其初始化结构.我们还将添加一个抽象方法 queryString,我们将在所有派生类中实现它以创建您想要的输出:

We'll begin with a base abstract class, ParsedObject, that will take the parsed tokens as its initializing structure. We'll also add an abstract method, queryString, which we'll implement in all the deriving classes to create your desired output:

class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

现在我们可以从这个类派生出来,任何子类都可以用作定义语法的解析动作.

Now we can derive from this class, and any subclass can be used as a parse action in defining the grammar.

当我们这样做时,为结构类型添加的 Group 会妨碍我们,因此我们将在没有它们的情况下重新定义原始解析器:

When we do this, Groups that were added for structure kind of get in our way, so we'll redefine the original parser without them:

search_term = Group(OneOrMore(word) | quotedString)('search_text') + 
                    Optional(field_qualifier)('field')
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

现在我们为 search_term 实现类,使用 self.tokens 访问输入字符串中的解析位:

Now we implement the class for search_term, using self.tokens to access the parsed bits found in the input string:

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

接下来我们将实现 and_termor_term 表达式.两者都是二元运算符,只是在输出查询中产生的运算符字符串不同,所以我们可以只定义一个类,让它们为各自的运算符字符串提供一个类常量:

Next we'll implement the and_term and or_term expressions. Both are binary operators differing only in their resulting operator string in the output query, so we can just define one class and let them provide a class constant for their respective operator strings:

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

请注意,pyparsing 与传统解析器略有不同 - 我们的 BinaryOperation 将匹配a or b or c"作为单个表达式,而不是作为嵌套对(a or b) or c".所以我们必须使用步进切片 [0::2] 重新加入所有术语.

Note that pyparsing is a little different from traditional parsers - our BinaryOperation will match "a or b or c" as a single expression, not as the nested pairs "(a or b) or c". So we have to rejoin all of the terms using the stepping slice [0::2].

最后,我们添加一个解析动作,通过将所有表达式包装在 () 中来反映任何嵌套:

Finally, we add a parse action to reflect any nesting by wrapping all exprs in ()'s:

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)

为方便起见,这里是一个复制/粘贴块中的整个解析器:

For your convenience, here is the entire parser in one copy/pastable block:

from pyparsing import *

LBRACK,RBRACK,LPAREN,RPAREN = map(Suppress,"[]()")
and_op = CaselessKeyword('and')
or_op = CaselessKeyword('or')
word = ~(and_op | or_op) + Word(alphanums + '.-/&')
field_qualifier = Group(LBRACK + OneOrMore(word) + RBRACK)

search_term = (Group(OneOrMore(word) | quotedString)('search_text') + 
               Optional(field_qualifier)('field'))
expr = Forward()
atom = search_term | (LPAREN + expr + RPAREN)
or_term = atom + ZeroOrMore(or_op + atom)
and_term = or_term + ZeroOrMore(and_op + or_term)
expr << and_term

# define classes for parsed structure
class ParsedObject(object):
    def __init__(self, tokens):
        self.tokens = tokens
    def queryString(self):
        '''Abstract method to be overridden in subclasses'''

class SearchTerm(ParsedObject):
    def queryString(self):
        text = ' '.join(self.tokens.search_text)
        if self.tokens.field:
            return '%s: %s' % (' '.join(f.lower() 
                                        for f in self.tokens.field[0]),text)
        else:
            return text
search_term.setParseAction(SearchTerm)

class BinaryOperation(ParsedObject):
    def queryString(self):
        joinstr = ' %s ' % self.op
        return joinstr.join(t.queryString() 
                                for t in self.tokens[0::2])
class OrOperation(BinaryOperation):
    op = "OR"
class AndOperation(BinaryOperation):
    op = "AND"
or_term.setParseAction(OrOperation)
and_term.setParseAction(AndOperation)

class Expr(ParsedObject):
    def queryString(self):
        return '(%s)' % self.tokens[0].queryString()
expr.setParseAction(Expr)


test = """("breast neoplasms"[MeSH Terms] OR breast cancer[Acknowledgments]  
OR breast cancer[Figure/Table Caption] OR breast cancer[Section Title]  
OR breast cancer[Body - All Words] OR breast cancer[Title]  
OR breast cancer[Abstract] OR breast cancer[Journal])  
AND (prevention[Acknowledgments] OR prevention[Figure/Table Caption]  
OR prevention[Section Title] OR prevention[Body - All Words]  
OR prevention[Title] OR prevention[Abstract])"""

res = expr.parseString(test)[0]
print res.queryString()

打印以下内容:

((mesh terms: "breast neoplasms" OR acknowledgments: breast cancer OR 
  figure/table caption: breast cancer OR section title: breast cancer OR 
  body - all words: breast cancer OR title: breast cancer OR 
  abstract: breast cancer OR journal: breast cancer) AND 
 (acknowledgments: prevention OR figure/table caption: prevention OR 
  section title: prevention OR body - all words: prevention OR 
  title: prevention OR abstract: prevention))

我猜你需要收紧一些输出 - 那些 lucene 标签名称看起来很模棱两可 - 我只是在关注你发布的示例.但是您不必对解析器进行太多更改,只需调整附加类的 queryString 方法即可.

I'm guessing you'll need to tighten up some of this output - those lucene tag names look very ambiguous - I was just following your posted sample. But you shouldn't have to change the parser much, just adjust the queryString methods of the attached classes.

作为海报的附加练习:在您的查询语言中添加对 NOT 布尔运算符的支持.

As an added exercise to the poster: add support for NOT boolean operator in your query language.

相关文章