如何从 JDOM 获取节点内容

2022-01-10 00:00:00 xml xml-parsing java jdom

我正在使用 import org.jdom.* 编写一个 java 应用程序；

I'm writing an application in java using import org.jdom.*;

我的 XML 是有效的，但有时它包含 HTML 标记.例如，像这样:

My XML is valid,but sometimes it contains HTML tags. For example, something like this:

<program-title>Anatomy & Physiology</program-title> <overview> <content> For more info click <a href="page.html">here</a> Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available. </content> </overview> <key-information> <category>Health & Human Services</category>

所以我的问题在于 overview.content 节点内的标签.

So my problem is with the tags inside the overview.content node.

我希望这段代码可以工作:

I was hoping that this code would work :

Element overview = sds.getChild("overview"); Element content = overview.getChild("content"); System.out.println(content.getText());

但它返回空白.

如何从 overview.content 节点返回所有文本(嵌套标签和所有)?

How do I return all the text ( nested tags and all ) from the overview.content node ?

谢谢

推荐答案

content.getText() 提供即时文本，该文本仅对带有文本内容的叶子元素有用.

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

技巧是使用 org.jdom.output.XMLOutputter (带文本模式 CompactFormat )

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception { SAXBuilder builder = new SAXBuilder(); String xmlFileName = "a.xml"; Document doc = builder.build(xmlFileName); Element root = doc.getRootElement(); Element overview = root.getChild("overview"); Element content = overview.getChild("content"); XMLOutputter outp = new XMLOutputter(); outp.setFormat(Format.getCompactFormat()); //outp.setFormat(Format.getRawFormat()); //outp.setFormat(Format.getPrettyFormat()); //outp.getFormat().setTextMode(Format.TextMode.PRESERVE); StringWriter sw = new StringWriter(); outp.output(content.getContent(), sw); StringBuffer sb = sw.getBuffer(); System.out.println(sb.toString()); }

输出

For more info click<a href="page.html">here</a>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.

请探索其他格式化选项并在上面进行修改根据您的需要编写代码.

Do explore other formatting options and modify above code to your need.

封装XMLOutputter格式选项的类.典型用户可以使用getRawFormat()(不改变空白)、getPrettyFormat()(空白美化)、getCompactFormat()(空白归一化)得到的标准格式配置."

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

相关文章