如何让lxml的iterparse忽略无效的XML字符?

2022-04-01 00:00:00 python xml xml-parsing lxml

问题描述

我有一个包含无效字符的XML。 LXML的XMLParser会对这些无效字符抛出异常,但当我使用Recover=True选项创建XMLParser时,它会忽略错误字符并正常工作。

我的问题是如何为lxml的iterparse函数设置类似标志?

复制:

损坏的XML(/tmp/z.xml):

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <B>Bad characters:</B>
    </item>
</items>

注意:"Bad Characters:"字符串后面有两个ASCII字符#31(0x1F),我无法将其复制粘贴到此处。

XMLParser的解析错误:

fd = open('/tmp/z.xml')
parser = etree.XMLParser()
tree   = etree.parse(fd, parser)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796)
  File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390)
  File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687)
  File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658)
  File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303)
  File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)
  File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21

要忽略错误字符,我设置了Recover=True,它运行正常:

import lxml.etree as etree
fd = open('/tmp/z.xml')
parser = etree.XMLParser(recover=True)
tree   = etree.parse(fd, parser)
etree.tostring(tree)

# OUTPUT:
<items>
	<item>
		<B>Bad characters:</B>
	</item>
</items>'

使用iterparse时,我再次收到相同的错误,但如何才能使其忽略错误字符?

fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21

解决方案

iterparse还接受recover参数:

it = etree.iterparse(fd, events=("start", "end"), recover=True)

(文档:lxml iterparse)

相关文章