在 JAVA 中使用 SAX 解析器从 XML 文件中提取文本节点

2022-01-10 00:00:00 xml parsing xml-parsing java sax

所以我目前正在使用 SAX 尝试从我正在处理的许多 xml 文档中提取一些信息.到目前为止，提取属性值真的很容易.但是，我不知道如何从文本节点中提取实际值.

So I am currently using SAX to try and extract some information from a a number of xml documents I am working from. Thus far, it is really easy to extract the attribute values. However, I have no clue how to go about extracting actual values from a text node.

例如，在给定的 XML 文档中:

For example, in the given XML document:

<w:rStyle w:val="Highlight" /> </w:rPr> </w:pPr> - <w:r> <w:t>Text to Extract</w:t> </w:r> </w:p> - <w:p w:rsidR="00B41602" w:rsidRDefault="00B41602" w:rsidP="007C3A42"> - <w:pPr> <w:pStyle w:val="Copy" />

通过从 val 获取值，我可以毫无问题地提取突出显示".但我不知道如何进入该文本节点并退出要提取的文本".

I can extract "Highlight" no problem by getting the value from val. But I have no idea how to get into that text node and get out "Text to Extract".

这是我迄今为止提取属性值的 Java 代码...

private static final class SaxHandler extends DefaultHandler { // invoked when document-parsing is started: public void startDocument() throws SAXException { System.out.println("Document processing starting:"); } // notifies about finish of parsing: public void endDocument() throws SAXException { System.out.println("Document processing finished. "); } // we enter to element 'qName': public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException { if(qName.equalsIgnoreCase("Relationships")) { // do nothing } else if(qName.equalsIgnoreCase("Relationship")) { // goes into the element and if the attribute is equal to "Target"... String val = attrs.getValue("Target"); // ...and the value is not null if(val != null) { // ...and if the value contains "image" in it... if (val.contains("image")) { // ...then get the id value String id = attrs.getValue("Id"); // ...and use the substring method to isolate and print out only the image & number int begIndex = val.lastIndexOf("/"); int endIndex = val.lastIndexOf("."); System.out.println("Id: " + id + " & Target: " + val.substring(begIndex+1, endIndex)); } } } else { throw new IllegalArgumentException("Element '" + qName + "' is not allowed here"); } } // we leave element 'qName' without any actions: public void endElement(String uri, String localName, String qName) throws SAXException { // do nothing; } }

但我不知道从哪里开始进入该文本节点并提取其中的值.有人有什么想法吗?

But I have no clue where to start to get into that text node and pull out the values inside. Anyone have some ideas?

推荐答案

下面是一些伪代码:

private boolean insideElementContainingTextNode; private StringBuilder textBuilder; public void startElement(String uri, String localName, String qName, Attributes attrs) { if ("w:t".equals(qName)) { // or is it localName? insideElementContainingTextNode = true; textBuilder = new StringBuilder(); } } public void characters(char[] ch, int start, int length) { if (insideElementContainingTextNode) { textBuilder.append(ch, start, length); } } public void endElement(String uri, String localName, String qName) { if ("w:t".equals(qName)) { // or is it localName? insideElementContainingTextNode = false; String theCompleteText = this.textBuilder.toString(); this.textBuilder = null; } }

相关文章