嵌套 PCRE 正则表达式问题

2022-01-07 00:00:00 regex nested php

我有一个自定义模板引擎.

I have a custom template engine.

它抓住了这个:

@function(argument1 argument2 ...)
@get(param:name)
@get(param:@get(sub:name))

还有这个:

@function(argument1 argument2 ...)

    Some stuff @with(nested:tag)

    @foreach(arguments as value)
        More stuff : @get(value)
    @/foreach

    @function(other:args)
        Same function name (nested)
    @/function

@/function

使用这种模式(PCRE/PHP):

With this pattern (PCRE / PHP) :

#

@ ([w]+) (

( (?: [^@)] | (?R) )+ )

)

(?:
    ( (?> (?-2) ) )

    @/\1
)?

#xms

这个正则表达式几乎可以捕获所有结果.但是当我有更多的嵌套(或不嵌套)标签时,它什么也抓不到.例如,当我做 2 个嵌套的 @foreach(var:name) ... @/foreach 然后正则表达式将失败取决于标签内容 spaces.

This regex catch almost all results. But when i have more nested (or not) tags, then it catch nothing. For example, when i do 2 nested @foreach(var:name) ... @/foreach then the regex will fail depending of the tag content spaces.

推荐答案

使用命名子模式有时更清晰.我建议你使用这个:

Using named subpatterns is sometimes more clear. I suggest you to use this:

~
@(?<com>w+)                 # command name
s*                          # possible white characters before args
(?: ( (?<args>[^)]*) ) )?+ # eventual parameters
(?:
    (?<content>(?:[^@]+|(?R))*+) # content (maybe empty)
    @/g{com}                    # close the command
)?+                          # optional
~

如果您需要在参数中允许命令,您可以将 (?[^)]*) 替换为 (?(?:[^@)]+|(?=@)(?R))*+)

If you need to allow commands inside arguments, you can replace (?<args>[^)]*) with (?<args>(?:[^@)]+|(?=@)(?R))*+)

但是当您尝试描述一种语言时,更好的方法是使用 (?(DEFINE)...) 语法在主要模式之前先描述元素,例如:>

But a better way when you are trying to describe a language is to use the (?(DEFINE)...) syntax to describe elements first, before the main pattern, example:

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<command_name> w+ )
    (?<inline_command> @ g<command_name> s* g<params>? )
    (?<multil_command> @ (g<command_name>) s* g<params>? g<content> @/ g{-1} )
    (?<command> g<multil_command> | g<inline_command> )

    (?<other> [^@()]+ ) 
    (?<param> g<other> | g<command> )
    (?<params> ( s* g<param> (?: s+ g<param> )* s* ) )

    (?<content> (?: g<other> | g<command> )* )
)
# main pattern
g<command>
~x
EOD;

使用这种语法,如果你想在底层提取元素,你只需要将主模式改为:@(?g)s*(?<args>g<params>)?(?: (?<con> g<content> ) @/g{com} )? (注意:要获得其他级别,请将其放在前瞻中)

With this kind of syntax, if you want to extract elements at the ground level, you only need to change the main pattern to: @(?<com> g<command_name> ) s* (?<args>g<params> )? (?: (?<con> g<content> ) @/ g{com} )? (NB: To obtain other levels, put it inside a lookahead)

相关文章