使用正则表达式删除相同类型的 html 嵌套标签的最终解决方案?

2022-01-07 00:00:00 regex nested dom php html

我已经花了几天时间试图找到一个使用正则表达式的解决方案(在有人说出来之前:我知道我应该使用 PHP DOM 文档库或类似的东西,但让我们把它当作一个理论问题),查找答案并我终于想出了在这个问题接近尾声时要展示的内容.

以下只是我之前尝试过的很多事情的总结.

首先,我所说的相同类型的嵌套标签是指:

任何 div 之外的文本<div id="my_id">布拉布拉<div>布拉布拉布拉<div style="这里有一些样式">啦啦啦啦啦

我被困在一个 div 中!

div 之外的更多文字<div>更多 div 在这里!<div id="justbeingannoying">radiohead 规则</div>

现在假设我想使用正则表达式删除所有 div 及其内容.所以预期的结果是:

任何 div 之外的文本div 之外的更多文字

第一个想法是匹配所有内容.以下正则表达式匹配具有属性(样式、id 等)的 div 标签:

/

]*>.*

/sig

当然,问题在于这将匹配第一个

"开头之间的所有内容,因此它会匹配morediv 外的文本"也是(在这里查看:https://regex101.com/r/iR8mY2/1 ),这不是我们 (I) 想要的.

这可以使用 U 修饰符(非贪婪)

解决

/

]*>.*

/sigU

但是我们会遇到比我们想要的少的问题:它只会匹配从第一个< div"到第一个"(因此,如果我们删除匹配项,除了一些不匹配的标签之外,还会有文本我被困在 div 中!",这是我们不想要的).

所以,我找到了一个解决方案,它就像嵌套括号、方括号等的魅力:

/[([^[]]*+|(?R))*]/si

基本上,它的作用是找到一个左方括号,然后匹配任何*既不是左方括号也不是右方括号的*或它的递归结构,找到一个右方括号.

我现在的工作是一个糟糕的解决方案:基本上,首先我用方括号替换所有开始标签(由于其他原因,它不能出现在我的代码中),然后是结束方括号的结束标签然后我使用以前的正则表达式.我知道这不是一个非常优雅的解决方案.

问题是我真的很想知道如何只用一个正则表达式就可以做到这一点.看起来比在前面的正则表达式中用 html 标签替换["和]"很明显必须起作用.但并没有那么容易.问题是字符的否定 ("[^.......]" 对像 "div" 这样的字符串不起作用.似乎可以通过以下方式实现类似的东西:

.+?(?=

)

当然,结束标记也是如此

.+?(?=

这或多或少是我到达这个正则表达式的方式

/

)|.+?(?=

))|(?R))*

;/地理信息系统

这与我之前介绍的第一个正则表达式完全一样:https://regex101.com/r/yU8pV3/1

那么,这是我的问题:那个正则表达式有什么问题?

谢谢!

解决方案

免责声明

由于这个问题得到了积极的回应,我将发布一个答案来解释您的方法有什么问题,并将展示如何匹配不是特定文本的文本.

但是,我想强调:不要用它来解析真实的、任意的 HTML 代码,因为正则表达式只能用于纯文本.

你的正则表达式有什么问题

您的正则表达式包含 <div((.+?(?=</div>)|.+?(?=<div>))|(?R))* 部分(与 <div((.+?(?=</?div>))|(?R))*) 匹配结束之前的 </div> 部分.当您有一些分隔文本时,不要依赖简单的懒惰/贪婪点匹配(除非用于展开循环结构 - 当您知道自己在做什么时).它的作用是:

  • <div - 从字面上匹配

    (同样,在

    中,由于缺少单词边界或 s 之后)
  • ( - 第 1 组匹配:
    • (.+?(?=</div>)|.+?(?=<div>)) - 匹配任意 1+ 个字符(尽可能少)直到第一个

      或第一个

    • |
    • (?R) - 递归(即插入和使用)
  • )* - 重复第 1 组零次或多次.

问题很明显:(.+?(?=</?div>)) 部分不排除匹配的

</div>,这个分支必须只匹配文本 NOT EQUAL 到前导和尾随定界符.

解决方案

要匹配某些特定文本以外的文本,请使用 tempered greedy token.>

((?:(?!;s*^^^^^^^^^^^^^^^^^^^

查看正则表达式演示.请注意,您必须使用 DOTALL 修饰符才能跨换行符匹配文本.捕获组是多余的,您可以将其删除.

这里重要的是 (?:(?!</?div).)+ 只匹配 1 个或多个不是 <div....></div 序列.请参阅我上面链接的线程,了解其工作原理.

至于性能,缓和的贪婪令牌非常消耗资源.展开循环技术来拯救:

(?:[^<]+(?:<(?!/?div)[^<]*)*|(?R))*

s*

参见这个正则表达式演示

现在,令牌看起来像 [^<]+(?:<(?!/?div)[^<]*)*: 1+ characters other比 < 后面跟着 0+ 个 < 序列,后面没有 /divdiv(如一个完整的词),然后又是 0+ 个非<s.

可能仍然匹配

,所以也许

)
是通过正则表达式处理此问题的更好方法.尽管如此,使用DOM解析HTML要容易得多.

I've been days trying to find a solution WITH regex (before somebody says it: I know I should been using the PHP DOM Document library or something alike, but let's take this as a theoretical question), looking answers up and I finally came up with what I'll show near the end of this question.

What follows is just a summary of a lot of things I've tried before.

First of all, what I mean by nested tags of the same type is:

Text outside any div
<div id="my_id"> bla bla
  <div>
  bla bla bla
    <div style="some style here">
      lalalalala
     </div>
   </div>
    I'm trapped in a div!
</div>
more text outside divs

<div>more divs here!
       <div id="justbeingannoying">radiohead rules</div>
</div>

Now imagine I want to remove all the divs and their content using regex. So the intended result would be:

Text outside any div
more text outside divs

The first idea would be matching everything. The following regex matches div tags with properties (style, id, etc):

/<div[^>]*>.*</div>/sig

The problem, of course, is that this will match everything between the beginning of the first "< div" and the last "< /div >", so it will match "more text outside divs" too (check here: https://regex101.com/r/iR8mY2/1 ), which is not want we (I) want.

This could be solved using the U modifier (Ungreedy)

/<div[^>]*>.*</div>/sigU

but then we'll have the problem of having less than we want: it will match only from the first "< div" till the first "" (so, if we remove the matches, besides some unmatched tags, will have the text "I'm trapped in a div!", which we don't want).

So, I found a solution that works like a charm for nested parenthesis, square brackets, etc:

/[([^[]]*+|(?R))*]/si

Basically, what this does is finding an opening square bracket, then matching anything *that is neither an opening nor a closing square bracket * OR a recursive structure of that, finding a closing square bracket.

What I have working now is a bad solution: basically, first I replace all the opening tags with an square bracket (which can't be in my code, for other reasons), then the closing tag for a closing square bracket and then I use the previous regex. Not a very elegant solution, I know.

The thing is I really want to know how this could be done with just one regex. It seems obvious than replacing in the previous regex the "[" and the "]" by the html tags has to work. But is not that easy. The problem is the negation for characters ("[^.......]" doesn't work for strings like "div". It seems that something similar can be achieved by this:

.+?(?=<div>)

and, of course, the same for the closing tag

.+?(?=</div>

This is how, more or less, I arrived to this regex

/<div((.+?(?=</div>)|.+?(?=<div>))|(?R))*</div>/gis

Which works exactly as the first regex I presented before: https://regex101.com/r/yU8pV3/1

So, here is my question: what is wrong with that regex?

Thank you!

解决方案

DISCLAIMER

Since the question is met with positive reaction, I will post an answer explaining what is wrong with your approach, and will show how to match text that is not some specific text.

HOWEVER, I want to emphasize: Do not use this to parse real, arbitrary HTML code, as regex should only be used on plain text.

What is wrong with your regex

Your regex contains <div((.+?(?=</div>)|.+?(?=<div>))|(?R))* part (same as <div((.+?(?=</?div>))|(?R))*) before matching the closing </div> part. When you have some delimited text, do not rely on plain lazy/greedy dot matching (unless used in unroll the loop structure - when you know what you are doing). What it does is this:

  • <div - match <div literally (also, in <diverse due to a missing word boundary or a s after it)
  • ( - Group 1 that matches:
    • (.+?(?=</div>)|.+?(?=<div>)) - matches either any 1+ chars (as few as possible) up to the first </div> or to the first <div>
    • |
    • (?R) - Recurse (i.e. insert and use)
  • )* - repeat Group 1 zero or more times.

The problem is clear: the (.+?(?=</?div>)) part does not exclude matching <div> or </div>, this branch MUST only match the text NOT EQUAL to the leading and trailing delimiters.

Solution(s)

To match text other than some specific text use a tempered greedy token.

<div[^<]*>((?:(?!</?div).)+|(?R))*</div>s*
             ^^^^^^^^^^^^^^^^^^^ 

See the regex demo. Note you must use a DOTALL modifier so as to be able to match text across newlines. A capturing group is redundant, you can remove it.

What is important here is that (?:(?!</?div).)+ only matches 1 or more characters that are not the starting character of a <div....> or </div sequences. See my above linked thread on how that works.

As for performance, tempered greedy tokens are resource-consuming. Unroll the loop technique comes to the rescue:

<div[^<]*>(?:[^<]+(?:<(?!/?div)[^<]*)*|(?R))*</div>s*

See this regex demo

Now, the token looks like [^<]+(?:<(?!/?div)[^<]*)*: 1+ characters other than < followed with 0+ sequences of < that is not followed with /div or div (as a whole word) and then again 0+ non-<s.

<div might still match in <div-tmp, so perhaps, <div(?:s|>) is a better way to deal with this via regex. Still, parsing HTML with DOM is much easier.

相关文章