使用 LXML 和 Python 解析空白 XML 标签

2022-01-10 00:00:00 python xml parsing xml-parsing



When parsing XML documents in the format of:



carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Color'] #Blue


This code will not work if a tag is empty such as :



Using the same code as above:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Model'] #Key Error


How would I parse this blank tag.


您正在放入一个 [text()] 过滤器,该过滤器仅显式询问具有文本节点的元素...然后当它没有给你没有文本节点的元素时你会不高兴?

You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes?


Leave that filter out, and you'll get your model element:

>>> s='''
... <root>
...   <Car>
...     <Color>Blue</Color>
...     <Make>Chevy</Make>
...     <Model/>
...   </Car>
... </root>'''
>>> e = lxml.etree.fromstring(s)
>>> carData = e.xpath('Car/node()')
>>> carData
[<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>]
>>> dict(((e.tag, e.text) for e in carData))
{'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

也就是说——如果你的直接目标是遍历树中的节点,你可以考虑使用 lxml.etree.iterparse() 代替,这将避免尝试构建完整的 DOM树在内存中,否则比构建树然后使用 XPath 对其进行迭代要高效得多.(想想 SAX,但没有疯狂和痛苦的 API).

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. (Think SAX, but without the insane and painful API).

使用 iterparse 实现可能如下所示:

Implementing with iterparse could look like this:

def get_cars(infile):
    in_car = False
    current_car = {}
    for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')):
        if event == 'start':
            if element.tag == 'Car':
                in_car = True
                current_car = {}
        if not in_car: continue
        if element.tag == 'Car':
            yield current_car
        current_car[element.tag] = element.text

for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')):
  print car

...这是更多代码,但是(如果我们不使用 StringIO 作为示例)它可以处理比内存容量大得多的文件.

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory.
