用于嵌套 Div 标签的 PHP RegExp

2022-01-07 00:00:00 regex nested tags php

我需要一个可以与 PHP 的 preg_match_all() 一起使用的正则表达式来匹配 div 标签中的内容.div 看起来像这样:

Content

到目前为止,我已经提出了这个正则表达式,它可以匹配所有 id="t[number]" 的 div

/

(.*?)<\/div>/

问题是当内容包含更多 div 时,嵌套的 div 是这样的:

<div id="t1">内容<div>更多内容</div></div>

关于如何让我的正则表达式与嵌套标签一起工作的任何想法?

谢谢

解决方案

尝试使用解析器:

require_once "simple_html_dom.php";$text = 'foo <div id="t1">内容<div>更多的东西</div></div>bar <div>甚至更多</div>baz <div id="t2">是</div>';$html = str_get_html($text);foreach($html->find('div') as $e) {if(isset($e->attr['id']) && preg_match('/^td++/', $e->attr['id'])) {echo $e->outertext ."
";}}

输出:

<div id="t1">内容<div>更多内容</div></div><div id="t2">是</div>

在此处下载解析器:http://simplehtmldom.sourceforge.net/

更多为了我自己的娱乐,我尝试用正则表达式来做.这是我想出的:

$text = 'foo <div id="t1">Content <div>more stuff</div></div>bar <div>甚至更多</div>baz <div id="t2">是 <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div>

';if(preg_match_all('#<divs+id="td+">[^<>]*(<div[^>]*>(?:[^<>]*|(?1))*</div>)[^<>]*</div>#si', $text, $matches)) {打印_r($matches[0]);}

输出:

数组([0] =><div id="t1">内容<div>更多内容</div></div>[1] =><div id="t2">是 <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div>

)

还有一个小解释:

# 匹配一个开头的 'div' 和一个以 't' 开头的 id 和一些数字[^<>]* # 匹配零个或多个除 '<' 之外的字符和 '>'( # 打开组 1<div[^>]*># 匹配一个开头的 'div'(?: # 打开一个不匹配的组[^<>]* # 匹配零个或多个除 '<' 之外的字符和 '>'|# 或者(?1) # 递归匹配组 1 定义的内容)* # 关闭不匹配的组并重复零次或多次

# 匹配一个结束的 'div') # 关闭组 1[^<>]* # 匹配零个或多个除 '<' 之外的字符和 '>'

# 匹配一个结束的 'div'

现在也许你明白为什么人们试图说服你不要使用正则表达式.如前所述,如果 html 格式不正确,这将无济于事:我向您保证,正则表达式会比 html 解析器造成更大的输出混乱.此外,正则表达式可能会让您眼花缭乱,您的同事(或将维护您的软件的人)可能会在看到您所做的事情后来找您.:)

最好的办法是首先清理您的输入(使用 TIDY 或类似的),然后使用解析器获取您想要的信息.

I need a regexp I can use with PHP's preg_match_all() to match out content inside div-tags. The divs look like this:

<div id="t1">Content</div>

I've come up with this regexp so far which matches out all divs with id="t[number]"

/<div id="t(\d)">(.*?)<\/div>/

The problem is when the content consists of more divs, nested divs like this:

<div id="t1">Content <div>more stuff</div></div>

Any ideas on how I make my regexp work with nested tags?

Thanks

解决方案

Try a parser instead:

require_once "simple_html_dom.php";
$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div> baz  <div id="t2">yes</div>';
$html = str_get_html($text);
foreach($html->find('div') as $e) {
    if(isset($e->attr['id']) && preg_match('/^td++/', $e->attr['id'])) {
        echo $e->outertext . "
";
    }
}

Output:

<div id="t1">Content <div>more stuff</div></div>
<div id="t2">yes</div>

Download the parser here: http://simplehtmldom.sourceforge.net/

Edit: More for my own amusement I tried to do it in regex. Here's what I came up with:

$text = 'foo <div id="t1">Content <div>more stuff</div></div> bar <div>even more</div>
      baz <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>';
if(preg_match_all('#<divs+id="td+">[^<>]*(<div[^>]*>(?:[^<>]*|(?1))*</div>)[^<>]*</div>#si', $text, $matches)) {
    print_r($matches[0]);
}

Output:

Array
(
    [0] => <div id="t1">Content <div>more stuff</div></div>
    [1] => <div id="t2">yes <div>aaa<div>bbb<div>ccc</div>bbb</div>aaa</div> </div>
)

And a small explanation:

<divs+id="td+">  # match an opening 'div' with an id that starts with 't' and some digits
[^<>]*             # match zero or more chars other than '<' and '>'
(                  # open group 1
  <div[^>]*>       #   match an opening 'div'
  (?:              #   open a non-matching group
    [^<>]*         #     match zero or more chars other than '<' and '>'
    |              #     OR
    (?1)           #     recursively match what is defined by group 1
  )*               #   close the non-matching group and repeat it zero or more times
  </div>           #   match a closing 'div'
)                  # close group 1
[^<>]*             # match zero or more chars other than '<' and '>'
</div>             # match a closing 'div'

Now perhaps you understand why people try to persuade you from not using regex for this. As already noted, it will not help if the the html is improperly formed: the regex will make a bigger mess of the output than an html parser, I assure you. Also, the regex will probably make your eyes bleed and your colleagues (or the people who will maintain your software) may come looking for you after seeing what you did. :)

Your best bet is to first clean up your input (using TIDY or similar), and then use a parser to get the info you want.

相关文章