Python元素树 - 从元素中提取文本，剥离标签

2022-01-10 00:00:00 python elementtree xml-parsing

问题描述

使用 Python 中的 ElementTree，如何从节点中提取所有文本，剥离该元素中的所有标签并仅保留文本?

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

例如，假设我有以下内容:

For example, say I have the following:

<tag> Some <a>example</a> text </tag>

我想返回一些示例文本.我该怎么做呢?到目前为止，我所采取的方法都产生了相当灾难性的后果.

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

解决方案

如果你在 Python 3.2+ 下运行，你可以使用 itertext.

If you are running under Python 3.2+, you can use itertext.

itertext 创建一个文本迭代器，它按文档顺序循环此元素和所有子元素，并返回所有内部文本:

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET xml = '<tag>Some <a>example</a> text</tag>' tree = ET.fromstring(xml) print(''.join(tree.itertext())) # -> 'Some example text'

如果你在较低版本的 Python 中运行，你可以重用 itertext() 的实现，通过将其附加到 Element 类，之后您可以像上面一样调用它:

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7 def itertext(self): tag = self.tag if not isinstance(tag, basestring) and tag is not None: return if self.text: yield self.text for e in self: for s in e.itertext(): yield s if e.tail: yield e.tail # if necessary, monkey-patch the Element class if 'itertext' not in ET.Element.__dict__: ET.Element.itertext = itertext xml = '<tag>Some <a>example</a> text</tag>' tree = ET.fromstring(xml) print(''.join(tree.itertext())) # -> 'Some example text'

相关文章