PHP:如何删除嵌套标签,并以非嵌套方式重新​​定位它们?

2022-01-07 00:00:00 递归 nested tags php

我需要从字符串中删除所有出现的 bb 样式标签.标签可以嵌套,这就是我失败的地方.我还需要将每个标签和内容重新定位到字符串的末尾,并用 HTML 元素替换该标签.我曾尝试使用 regex 和 preg_replace_callback,但到目前为止我还没有成功.我也尝试修改以下内容,但也没有运气:在 PHP 中删除嵌套的 bbcode(引号)和如何删除 html 元素和它的内容使用 RegEx 我不认为我可以使用 HTML 解析器像这样,因为HTML 格式错误(不能有子元素的元素中的子元素).

I need to remove all occurrences of a bb style tag from a string. The tags can be nested, and this is where I am failing. I also need to relocate each tag and contents to the end of the string, and replace the tag with an HTML element. I have tried to play with regex and preg_replace_callback, but I have only been so far unsuccessful. I also tried to modify the following, and have also had no luck: Removing nested bbcode (quotes) in PHP and How can I remove an html element and it's contents using RegEx I don't think I can use an HTML parser like this because the HTML is malformed (children in elements that can't have children).

这是字符串的样子:

This is some 
[tag] attribute=1 attribute2=1 
     [tag] attribute=1 attribute2=1 [/tag] 
     [tag] attribute=1 attribute2=1 [/tag]
[/tag]
 text.

结果应该是这样的:

This is some text.
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>
<br attribute=1 attribute2=1>

任何帮助将不胜感激.

推荐答案

Street cred:我为 Infopop(后来被称为 Groupee,现在的 Social Strata)工作,UBBCode 的创造者,被复制并转化为简单的东西旧的常规BBCode".

Street cred: I worked for Infopop (later known as Groupee, now Social Strata), the creators of UBBCode, the thing that was copied and transformed into just plain old regular "BBCode."

tl;dr:是时候编写自己的非正则表达式解析器了.

tl;dr: Time to write your own non-regex parser.

大多数 BBCode 解析器使用正则表达式,这适用于大多数情况,但您在这里做一些自定义的事情.普通的旧正则表达式不会帮助你.正则表达式有两种妨碍我们的操作模式:我们可以在贪婪"模式或非贪婪"模式下匹配两个标签之间的所有内容.

Most BBCode parsers use regexes, and that works for most cases, but you're doing something custom here. Plain old regular expressions are not going to help you. Regexes have two modes of operation that get in our way: we can either match everything between two tags in "greedy" mode, or in "not greedy" mode.

在贪婪"模式下,我们将捕获第一个打开任务和最后一个结束标记之间的所有内容.这让事情变得非常糟糕.以这个案例为例:

In "greedy" mode, we'll capture everything between the very first opening task and the very last closing tag. This breaks things horribly. Take this case:

[a][b][c]...[/c][/b][/a]...[a]...[/a]

[a].+[/a] 这样的贪婪正则表达式将抓取从第一个开始标记到 最后一个 结束标记的所有内容,忽略了关闭器并没有关闭开启器这一事实.

A greedy regex like [a].+[/a] is going to grab everything from that first opening tag to that last closing tag, ignoring the fact that the closer isn't closing the opener.

另一种选择更糟.以这个案例为例:

The other option is worse. Take this case:

[a][b][a]...[/a][/b][/a]

[a].+?[/a] 这样的非贪婪的正则表达式(唯一的变化是问号)将匹配第一个开始标签,但是它'将匹配第一个结束标记,再次忽略结束标记不属于开始标记.

An ungreedy regex like [a].+?[/a] (the only change is the question mark) is going to match the first opening tag, but then it'll match the first closing tag, again ignoring that the closing tag doesn't belong to the opening tag.

我以这种方式解决的方法,回到原始时代是完全忽略开始和结束标签不匹配的事实.我只是循环了整个标签转换正则表达式链,直到输出停止变化.它简单而有效,主要是因为有意限制了可用的标记集,因此嵌套从来都不是问题.

The way I solved this way, way back in the primitive days was to completely ignore the fact that the opening and closing tags didn't match. I simply looped the entire chain of tag transformation regexes until the output stopped changing. It was simple and effective, mainly because the available tag set was intentionally limited, so nesting was never an issue.

一旦您允许嵌套相同的标签,盲目的暴力就不再是合适的工具.

The instant you allow nesting of identical tags, blind, brute force is no longer a suitable tool.

如果没有一个 BBCode 解析引擎适合您,您可能需要自己编写.检查全部.在 PEAR 上有一些,有一个 PECL 扩展,等等.还要检查其他语言的灵感,Perl 的 CPAN 有十几种不同的实现,其中一些非常强大和复杂(如果在那个组合中没有合适的递归下降解析器,我会郁闷的).这是一个很好的挑战,但并不难.再说一次,我现在已经写了五个(没有一个我可以发布),所以也许我有偏见?

If none of the BBCode parsing engines out there are going to work for you, you might have to write your own. Check all of them out. There are some on PEAR, there's a PECL extension, etc. Also check other languages for inspiration, Perl's CPAN has a dozen different implementations, some of which are very powerful and complex (if there isn't a proper recursive descent parser in that mix, I'll be depressed). This is a good challenge, but it's not too hard. Then again, I've written like five now (none of which I can release), so maybe I'm biased?

首先分解 [] 上的字符串.遍历结果数组,跟踪左括号后面和下一个右括号之前的数组索引何时看起来像一个有效的标记和/或属性.您将需要考虑当属性可以包含括号时会发生什么,或者更糟的是,URL 是大量括号(如 PHP 数组语法).您还需要考虑一般属性,包括如何(如果?)引用它们,是否允许每个标签有多个属性(如您的示例),以及如何处理无效属性.

Start by exploding the string on [ and ]. Go through the resulting array, keeping track of when the array index following the opening bracket and before the next closing bracket happens to look like a valid tag and/or attributes. You're going to need to think about what happens when an attribute can contain a bracket, or worse, are URLs that are bracket-heavy (like PHP array syntax). You'll also need to think about attributes in general, including how (if?) they are quoted, if multiple attributes per tag are allowed (as in your example), and what to do with invalid attributes.

当您继续处理字符串时,您还需要跟踪打开的标签以及打开的顺序.您必须考虑其他标签中允许使用哪些标签.您还必须处理错误嵌套,例如 [a][b][/a][/b].您的选择是在外部关闭后重新打开内部标签,或者在外部关闭后立即关闭内部标签.更糟糕的是,根据情况不同的行为可能是有意义的.更糟糕的是像 [*] 内的 [list] 这样古怪的标签,传统上没有结束标签!

As you continue to process the string, you will also need to keep track of what tags are open, and in what order. You'll have to think about what tags are permitted inside other tags. You'll also have to deal with mis-nesting, like [a][b][/a][/b]. Your options will be either re-opening the inner tag after the outer closes, or closing the inner as soon as the outer does. Worse, different behavior might make sense depending on the situation. Worse-worse are wacky tags like [*] inside [list], which traditionally doesn't have a closing tag!

一旦您处理了字符串并创建了一个打开和关闭标签的列表(并可能重新平衡打开和关闭),然后您就可以将结果转换为 HTML,或者您的输出最终是什么.这是您将这些特定标签的输出移动到新文档末尾的时间和方式.

Once you've processed the string and have created a list of open and closing tags (and possibly re-balanced the opens and closes), then you can transform the result into HTML, or whatever your output ends up being. This is when and how you'd move the output of those specific tags to the end of the new document.

完成后,编写一千个测试用例.尝试打破它,把它炸成小块,产生 XSS 漏洞,否则尽你最大的努力让你的生活陷入困境.这将是值得的,因为结果将是一个 BBCode 引擎,它将完成您想要做的事情.

Once you've finished up, write a thousand test cases. Try to break it, blow it into itty bitty chunks, produce XSS vulnerabilities, and otherwise do your best to make your life hell. It will be worth it, because the result will be a BBCode engine that will do what you're trying to do.

相关文章