在 Python 中与 finditer() 重叠匹配
问题描述
I'm using a regex to match Bible verse references in a text. The current regex is
REF_REGEX = re.compile('''
(?<!w) # Not preceded by any words
(?P<quote>q(?:uote)?s+)? # Match optional 'q' or 'quote' followed by many spaces
(?P<book>
(?:(?:[1-3]|I{1,3})s*)? # Match an optional arabic or roman number between 1 and 3.
[A-Za-z]+ # Match any alphabetics
).? # Followed by an optional dot
(?:
s*(?P<chapter>d+) # Match the chapter number
(?:
[:.](?P<startverse>d+) # Match the starting verse number, preceded by ':' or '.'
(?:-(?P<endverse>d+))? # Match the optional ending verse number, preceded by '-'
)? # Verse numbers are optional
)
(?:
s+(?: # Here be spaces
(?:froms+)|(?:ins+)|(?P<lbrace>()) # Match 'from[:space:]', 'in[:space:]' or '('
s*(?P<version>w+) # Match a word preceded by optional spaces
(?(lbrace))) # Close the '(' if found earlier
)? # The whole 'in|from|()' is optional
''', re.IGNORECASE | re.VERBOSE | re.UNICODE)
This matches the following expressions fine:
"jn 3:16": (None, 'jn', '3', '16', None, None, None),
"matt. 18:21-22": (None, 'matt', '18', '21', '22', None, None),
"q matt. 18:21-22": ('q ', 'matt', '18', '21', '22', None, None),
"QuOTe jn 3:16": ('QuOTe ', 'jn', '3', '16', None, None, None),
"q 1co13:1": ('q ', '1co', '13', '1', None, None, None),
"q 1 co 13:1": ('q ', '1 co', '13', '1', None, None, None),
"quote 1 co 13:1": ('quote ', '1 co', '13', '1', None, None, None),
"quote 1co13:1": ('quote ', '1co', '13', '1', None, None, None),
"jean 3:18 (PDV)": (None, 'jean', '3', '18', None, '(', 'PDV'),
"quote malachie 1.1-2 fRom Colombe": ('quote ', 'malachie', '1', '1', '2', None, 'Colombe'),
"quote malachie 1.1-2 In Colombe": ('quote ', 'malachie', '1', '1', '2', None, 'Colombe'),
"cinq jn 3:16 (test)": (None, 'jn', '3', '16', None, '(', 'test'),
"Q IIKings5.13-58 from wolof": ('Q ', 'IIKings', '5', '13', '58', None, 'wolof'),
"This text is about lv5.4-6 in KJV only": (None, 'lv', '5', '4', '6', None, 'KJV'),
but it fails to parse:
"Found in 2 Cor. 5:18-21 ( Ministers": (None, '2 Cor', '5', '18', '21', None, None),
because it returns (None, 'in', '2', None, None, None, None)
instead.
Is there a way to get finditer() to return all matches, even if they overlap, or is there a way to improve my regex so it matches this last bit properly?
Thanks.
解决方案A character consumed is consumed, you should not ask the regex engine to go back.
From your examples the verse part (e.g. :1
) seems not optional. Removing that will match the last bit.
ref_regex = re.compile('''
(?<!w) # Not preceeded by any words
((?i)q(?:uote)?s+)? # Match 'q' or 'quote' followed by many spaces
(
(?:(?:[1-3]|I{1,3})s*)? # Match an arabic or roman number between 1 and 3.
[A-Za-z]+ # Match many alphabetics
).? # Followed by an optional dot
(?:
s*(d+) # Match the chapter number
(?:
[:.](d+) # Match the verse number
(?:-(d+))? # Match the ending verse number
) # <-- no '?' here
)
(?:
s+
(?:
(?i)(?:froms+)| # Match the keyword 'from' or 'in'
(?:ins+)|
(?P<lbrace>() # or stuff between (...)
)s*(w+)
(?(lbrace)))
)?
''', re.X | re.U)
(If you're going to write a gigantic RegEx like this, please use the /x
flag.)
If you really need overlapping matches, you could use a lookahead. A simple example is
>>> rx = re.compile('(.)(?=(.))')
>>> x = rx.finditer("abcdefgh")
>>> [y.groups() for y in x]
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('e', 'f'), ('f', 'g'), ('g', 'h')]
You may extend this idea to your RegEx.
相关文章