即使在 pretty_print=True 时，使用 lxml 编写也不会产生空格

2022-01-10 00:00:00 python xml xml-parsing lxml

问题描述

我正在使用 lxml 库来读取 xml 模板，插入/更改一些元素，并保存生成的 xml.我使用 etree.Element 和 etree.SubElement 方法动态创建的元素之一:

I'm using the lxml library to read an xml template, insert/change some elements, and save the resulting xml. One of the elements which I'm creating on the fly using the etree.Element and etree.SubElement methods:

tree = etree.parse(r'xml_archive emplatesmetadata_template_pts.xml') root = tree.getroot() stream = [] for element in root.iter(): if isinstance(element.tag, basestring): stream.append(element.tag) # Find "keywords" element and insert a new "theme" element if element.tag == 'keywords' and 'theme' not in stream: theme = etree.Element('theme') themekt = etree.SubElement(theme, 'themekt').text = 'None' for tk in themekeys: themekey = etree.SubElement(theme, 'themekey').text = tk element.insert(0, theme)

很好地打印到屏幕上print etree.tostring(theme, pretty_print=True):

<theme> <themekt>None</themekt> <themekey>Hydrogeology</themekey> <themekey>Stratigraphy</themekey> <themekey>Floridan aquifer system</themekey> <themekey>Geology</themekey> <themekey>Regional Groundwater Availability Study</themekey> <themekey>USGS</themekey> <themekey>United States Geological Survey</themekey> <themekey>thickness</themekey> <themekey>altitude</themekey> <themekey>extent</themekey> <themekey>regions</themekey> <themekey>upper confining unit</themekey> <themekey>FAS</themekey> <themekey>base</themekey> <themekey>geologic units</themekey> <themekey>geology</themekey> <themekey>extent</themekey> <themekey>inlandWaters</themekey> </theme>

但是，当使用 etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True) 写出 xml 时，此元素在输出文件中被展平:

However, when using etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True) to write out the xml, this element gets flattened in the output file:

<theme><themekt>None</themekt><themekey>Hydrogeology</themekey><themekey>Stratigraphy</themekey><themekey>Floridan aquifer system</themekey><themekey>Geology</themekey><themekey>Regional Groundwater Availability Study</themekey><themekey>USGS</themekey><themekey>United States Geological Survey</themekey><themekey>thickness</themekey><themekey>altitude</themekey><themekey>extent</themekey><themekey>regions</themekey><themekey>upper confining unit</themekey><themekey>FAS</themekey><themekey>base</themekey><themekey>geologic units</themekey><themekey>geology</themekey><themekey>extent</themekey><themekey>inlandWaters</themekey></theme>

文件的其余部分写得很好，但是这个特殊的元素正在引起(纯粹是审美的)麻烦.关于我做错了什么的任何想法?

The rest of the file is written nicely, but this particular element is causing (purely aesthetic) trouble. Any ideas of what I'm doing wrong?

以下是来自模板 xml 文件的标记片段(将其保存为template.xml"以在底部与代码片段一起运行).标签的扁平化仅在我解析现有文件并插入新元素时发生，而不是在使用 lxml 从头创建 xml 时发生.

Below is a snippet of markup from the template xml file (save this as "template.xml" to run with code snippet at bottom). The flattening of tags only occurs when I parse an existing file and insert a new element, not when the xml is created from scratch using lxml.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="fgdc_classic.xsl"?> <metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://water.usgs.gov/GIS/metadata/usgswrd/fgdc-std-001-1998.xsd"> <keywords> <theme> <themekt>ISO 19115 Topic Categories</themekt> <themekey>environment</themekey> <themekey>geoscientificInformation</themekey> <themekey>inlandWaters</themekey> </theme> <place> <placekt>None</placekt> <placekey>Florida</placekey> <placekey>Georgia</placekey> <placekey>Alabama</placekey> <placekey>South Carolina</placekey> </place> </keywords> </metadata>

下面是与标记片段(上图)一起使用的代码片段:

Below is a snippet of code to be used with the snippet of markup (above):

# Create new theme element to insert into root themekeys = ['Hydrogeology', 'Stratigraphy', 'inlandWaters'] tree = etree.parse(r'template.xml') root = tree.getroot() stream = [] for element in root.iter(): if isinstance(element.tag, basestring): stream.append(element.tag) # Edit theme keywords if element.tag == 'keywords': theme = etree.Element('theme') themekt = etree.SubElement(theme, 'themekt').text = 'None' for tk in themekeys: themekey = etree.SubElement(theme, 'themekey').text = tk element.insert(0, theme) # Write XML to new file out_xml_file = 'test.xml' etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True) with open(out_xml_file, 'r') as f: lines = f.readlines() with open(out_xml_file, 'w') as f: f.write('<?xml version="1.0" encoding="UTF-8"?> ') for line in lines: f.write(line)

解决方案

如果你替换这行:

tree = etree.parse(r'template.xml')

这些行:

parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(r'template.xml', parser)

那么它将按预期工作.诀窍是使用具有 remove_blank_text 选项设置为 True.任何现有的可忽略空格都将被删除，因此不会破坏后续的漂亮打印.

then it will work as expected. The trick is to use an XMLParser that has the remove_blank_text option set to True. Any existing ignorable whitespace will be removed and will therefore not disrupt the subsequent pretty-printing.

相关文章