如何从 JDOM 获取节点内容

2022-01-10 00:00:00 xml xml-parsing java jdom

我正在使用 import org.jdom.* 编写一个 java 应用程序;

I'm writing an application in java using import org.jdom.*;

我的 XML 是有效的,但有时它包含 HTML 标记.例如,像这样:

My XML is valid,but sometimes it contains HTML tags. For example, something like this:

  <program-title>Anatomy &amp; Physiology</program-title>
  <overview>
       <content>
              For more info click <a href="page.html">here</a>
              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>
       </content>
  </overview>
  <key-information>
     <category>Health &amp; Human Services</category>

所以我的问题在于 <p > overview.content 节点内的标签.

So my problem is with the < p > tags inside the overview.content node.

我希望这段代码可以工作:

I was hoping that this code would work :

        Element overview = sds.getChild("overview");
        Element content = overview.getChild("content");

        System.out.println(content.getText());

但它返回空白.

如何从 overview.content 节点返回所有文本(嵌套标签和所有)?

How do I return all the text ( nested tags and all ) from the overview.content node ?

谢谢

推荐答案

content.getText() 提供即时文本,该文本仅对带有文本内容的叶子元素有用.

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

技巧是使用 org.jdom.output.XMLOutputter (带文本模式 CompactFormat )

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception {
    SAXBuilder builder = new SAXBuilder();
    String xmlFileName = "a.xml";
    Document doc = builder.build(xmlFileName);

    Element root = doc.getRootElement();
    Element overview = root.getChild("overview");
    Element content = overview.getChild("content");

    XMLOutputter outp = new XMLOutputter();

    outp.setFormat(Format.getCompactFormat());
    //outp.setFormat(Format.getRawFormat());
    //outp.setFormat(Format.getPrettyFormat());
    //outp.getFormat().setTextMode(Format.TextMode.PRESERVE);

    StringWriter sw = new StringWriter();
    outp.output(content.getContent(), sw);
    StringBuffer sb = sw.getBuffer();
    System.out.println(sb.toString());
}

输出

For more info click<a href="page.html">here</a><p>Learn more about the human body. Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>

请探索其他 格式化 选项并在上面进行修改根据您的需要编写代码.

Do explore other formatting options and modify above code to your need.

封装XMLOutputter格式选项的类.典型用户可以使用getRawFormat()(不改变空白)、getPrettyFormat()(空白美化)、getCompactFormat()(空白归一化)得到的标准格式配置."

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

相关文章