PHP“漂亮的印刷品"HTML(不整洁)

2022-01-15 00:00:00 format php html tidy

我正在使用 PHP 中的 DOM 扩展来构建一些 HTML 文档,并且我希望输出格式正确(带有新行和缩进),以便它可读,但是,从我所做的许多测试中:

  1. "formatOutput = true" 对 saveHTML() 根本不起作用,只能使用 saveXML()
  2. 即使我使用了 saveXML(),它仍然只适用于通过 DOM 创建的元素,而不是包含在 loadHTML() 中的元素,即使是preserveWhiteSpace = false"

如果有人知道不同,我真的很想知道他们是如何让它工作的.

所以,我有一个 DOM 文档,我正在使用 saveHTML() 来输出 HTML.由于它来自 DOM,我知道它是有效的,因此无需整理"或以任何方式验证它.

我只是在寻找一种方法来从我从 DOM 扩展收到的输出中获得格式良好的输出.

注意.正如您可能已经猜到的那样,我不想使用 Tidy 扩展作为 a) 它做的更多我也需要它(标记已经有效)并且 b) 它实际上对 HTML 内容进行了更改(例如HTML 5 文档类型和一些元素).

跟进:

好的,在下面的答案的帮助下,我找出了 DOM 扩展不起作用的原因.尽管给定的示例有效,但它仍然无法与我的代码一起使用.在 this 评论的帮助下,我发现如果你有任何isWhitespaceInElementContent() 为 true 的文本节点,超出该点不会应用任何格式.无论preserveWhiteSpace 是否为假,都会发生这种情况.解决方案是删除所有这些节点(虽然我不确定这是否会对实际内容产生不利影响).

解决方案

你是对的,HTML似乎没有缩进(其他人也很困惑).XML 可以工作,即使是加载了代码.

loadHTML($buffer);$dom->formatOutput = true;return($dom->saveHTML());}//使用我们的 nice 开始输出缓冲//回调函数来格式化输出.ob_start("tidyHTML");?><html><头><title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>这就像比较苹果和橘子.</p></body></html>

结果:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><头><title>foo 栏</title><元名称="bar" 值="foo"></头><身体><h1>bar foo</h1><p>这就像比较苹果和橘子.</p></身体></html>

与 saveXML() 相同 ...

<?xml version="1.0" Standalone="yes"?><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 过渡//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><头><title>foo 栏</title><meta name="bar" value="foo"/></头><身体><h1>bar foo</h1><p>这就像比较苹果和橘子.</p></身体></html>

可能忘记在 loadHTML 之前设置 preserveWhiteSpace=false?

<块引用>

免责声明:我从 tyson clugg/php 窃取了大部分演示代码手动注释.懒我.

<小时><块引用>

更新:我现在记得几年前我尝试过同样的事情并遇到了同样的问题.我通过应用一个肮脏的解决方法来解决这个问题(不是性能关键):我只是以某种方式在 SimpleXML 和 DOM 之间转换,直到问题消失.我想转换摆脱了那些节点.可能用 dom 加载,用 simplexml_import_dom 导入,然后输出字符串,再次用 DOM 解析它,然后 然后 打印出来.据我记得这很有效(但它真的很慢).

I'm using the DOM extension in PHP to build some HTML documents, and I want the output to be formatted nicely (with new lines and indentation) so that it's readable, however, from the many tests I've done:

  1. "formatOutput = true" doesn't work at all with saveHTML(), only saveXML()
  2. Even if I used saveXML(), it still only works on elements created via the DOM, not elements that are included with loadHTML(), even with "preserveWhiteSpace = false"

If anyone knows differently I'd really like to know how they got it to work.

So, I have a DOM document, and I'm using saveHTML() to output the HTML. As it's coming from the DOM I know it is valid, there's no need to "Tidy" or validate it in any way.

I'm simply looking for a way to get nicely formatted output from the output I receive from the DOM extension.

NB. As you may have guessed, I don't want to use the Tidy extension as a) it does a lot more that I need it too (the markup is already valid) and b) it actually makes changes to the HTML content (such as the HTML 5 doctype and some elements).

Follow Up:

OK, with the help of the answer below I've worked out why the DOM extension wasn't working. Although the given example works, it still wasn't working with my code. With the help of this comment I found that if you have any text nodes where isWhitespaceInElementContent() is true no formatting will be applied beyond that point. This happens regardless of whether or not preserveWhiteSpace is false. The solution is to remove all of these nodes (although I'm not sure if this may have adverse effects on the actual content).

解决方案

you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.

<?php
function tidyHTML($buffer) {
    // load our document into a DOM object
    $dom = new DOMDocument();
    // we want nice output
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($buffer);
    $dom->formatOutput = true;
    return($dom->saveHTML());
}

// start output buffering, using our nice
// callback function to format the output.
ob_start("tidyHTML");

?>
<html>
    <head>
    <title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
<?php
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
ob_end_flush();
?>

result:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo">
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>

the same with saveXML() ...

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
    <title>foo bar</title>
    <meta name="bar" value="foo"/>
  </head>
  <body>
    <h1>bar foo</h1>
    <p>It's like comparing apples to oranges.</p>
  </body>
</html>

probably forgot to set preserveWhiteSpace=false before loadHTML?

disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.


UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with simplexml_import_dom, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).

相关文章