使用 LXML 和 Python 解析空白 XML 标签

2022-01-10 00:00:00 python xml parsing xml-parsing

问题描述

当解析XML文档格式为:

When parsing XML documents in the format of:

<Car> <Color>Blue</Color> <Make>Chevy</Make> <Model>Camaro</Model> </Car>

我使用以下代码:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]') parsedCarData = [{field.tag: field.text for field in carData} for action in carData] print parsedCarData[0]['Color'] #Blue

如果标签为空，则此代码将不起作用，例如:

This code will not work if a tag is empty such as :

<Car> <Color>Blue</Color> <Make>Chevy</Make> <Model/> </Car>

使用与上面相同的代码:

Using the same code as above:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]') parsedCarData = [{field.tag: field.text for field in carData} for action in carData] print parsedCarData[0]['Model'] #Key Error

我将如何解析这个空白标签.

How would I parse this blank tag.

解决方案

您正在放入一个 [text()] 过滤器，该过滤器仅显式询问具有文本节点的元素...然后当它没有给你没有文本节点的元素时你会不高兴?

You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes?

去掉那个过滤器，你会得到你的模型元素:

Leave that filter out, and you'll get your model element:

>>> s=''' ... <root> ... <Car> ... <Color>Blue</Color> ... <Make>Chevy</Make> ... <Model/> ... </Car> ... </root>''' >>> e = lxml.etree.fromstring(s) >>> carData = e.xpath('Car/node()') >>> carData [<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>] >>> dict(((e.tag, e.text) for e in carData)) {'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

也就是说——如果你的直接目标是遍历树中的节点，你可以考虑使用 lxml.etree.iterparse() 代替，这将避免尝试构建完整的 DOM树在内存中，否则比构建树然后使用 XPath 对其进行迭代要高效得多.(想想 SAX，但没有疯狂和痛苦的 API).

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. (Think SAX, but without the insane and painful API).

使用 iterparse 实现可能如下所示:

Implementing with iterparse could look like this:

def get_cars(infile): in_car = False current_car = {} for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')): if event == 'start': if element.tag == 'Car': in_car = True current_car = {} continue if not in_car: continue if element.tag == 'Car': yield current_car continue current_car[element.tag] = element.text for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')): print car

...这是更多代码，但是(如果我们不使用 StringIO 作为示例)它可以处理比内存容量大得多的文件.

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory.

相关文章