用于检测 & 分号终止的 C++ 的正则表达式while 循环

2021-12-02 00:00:00 python 递归 regex parsing c++

在我的 Python 应用程序中,我需要编写一个正则表达式来匹配以分号 (;).例如,它应该匹配:

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

for (int i = 0; i < 10; i++);

...但不是这个:

for (int i = 0; i < 10; i++)

这乍一看似乎微不足道,直到您意识到左括号和右括号之间的文本可能包含其他括号,例如:

This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

for (int i = funcA(); i < funcB(); i++);

我正在使用 python.re 模块.现在我的正则表达式看起来像这样(我已经留下了我的评论,所以你可以更容易地理解它):

I'm using the python.re module. Right now my regular expression looks like this (I've left my comments in so you can understand it easier):

# match any line that begins with a "for" or "while" statement:
^s*(for|while)s*
(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        ( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        ) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
s*;s*

这对于上述所有情况都非常有效,但是一旦您尝试使 for 循环的第三部分包含一个函数,它就会中断,如下所示:

This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

for (int i = 0; i < 10; doSomethingTo(i));

我认为它会中断,因为只要在左括号和右括号之间放置一些文本,平衡"组就会匹配包含的文本,因此 (?P=balanced) 部分不会不再起作用,因为它不会匹配(因为括号内的文本不同).

I think it breaks because as soon as you put some text between the opening and closing parenthesis, the "balanced" group matches that contained text, and thus the (?P=balanced) part doesn't work any more since it won't match (due to the fact that the text inside the parenthesis is different).

在我的 Python 代码中,我使用了 VERBOSE 和 MULTILINE 标志,并像这样创建正则表达式:

In my Python code I'm using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
^s*(for|while)s*
(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches
    # a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        ( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        ) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
s*;s*"""

REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)

谁能建议改进这个正则表达式?太复杂了,我无法理解.

Can anyone suggest an improvement to this regular expression? It's getting too complicated for me to get my head around.

推荐答案

您可以编写一个非常简单的小程序来完成它,而无需使用正则表达式:

You could write a little, very simple routine that does it, without using a regular expression:

  • 设置一个位置计数器pos,使其指向forwhile 之后的左括号之前.
  • 将左括号计数器 openBr 设置为 0.
  • 现在继续增加pos,读取相应位置的字符,当看到左括号时增加openBr,看到右括号时减少它.这将在开始时增加一次,对于for ("中的第一个左括号,对中间的一些括号增加和减少一些,并将其设置回0 当您的 for 括号关闭时.
  • 所以,当 openBr 再次为 0 时停止.
  • Set a position counter pos so that is points to just before the opening bracket after your for or while.
  • Set an open brackets counter openBr to 0.
  • Now keep incrementing pos, reading the characters at the respective positions, and increment openBr when you see an opening bracket, and decrement it when you see a closing bracket. That will increment it once at the beginning, for the first opening bracket in "for (", increment and decrement some more for some brackets in between, and set it back to 0 when your for bracket closes.
  • So, stop when openBr is 0 again.

停止位置是 for(...) 的右括号.现在你可以检查后面是否有分号.

The stopping positon is your closing bracket of for(...). Now you can check if there is a semicolon following or not.

相关文章