iterparse 无法解析字段，而其他类似的都可以

2022-01-10 00:00:00 python xml xml-parsing iterparse

问题描述

我使用 Python 的 iterparse 来解析 nessus 扫描的 XML 结果(.nessus 文件).意外记录解析失败，但类似的记录已正确解析.

I use Python's iterparse to parse the XML result of a nessus scan (.nessus file). The parsing fails on unexpected records, wile similar ones have been parsed correctly.

XML 文件的一般结构是很多记录，如下所示:

The general structure of the XML file is a lot of records like the one below:

<ReportHost> <ReportItem> <foo>9.3</foo> <bar>hello</bar> </ReportItem> <ReportItem> <foo>10.0</foo> <bar>world</bar> </ReportHost> <ReportHost> ... </ReportHost>

换句话说，很多主机 (ReportHost) 有很多要报告的项目 (ReportItem)，而后者有几个特征 (foo，条).我将考虑为每个项目生成一行，并具有其特征.

In other words a lot of hosts (ReportHost) with a lot of items to report (ReportItem), and the latter having several characteristics (foo, bar). I will be looking at generating one line per item, with its characteristics.

在文件中间的一行简单的解析失败(foo 在这种情况下是 cvss_base_score)

The parsing fails in the middle of the file at a simple line (foo in that case being cvss_base_score)

<cvss_base_score>9.3</cvss_base_score>

虽然已经解析了大约 200 条类似的行，但没有问题.

while ~200 similar lines have been parsed without problems.

相关的代码如下——它设置了上下文标记(inReportHost 和 inReportEvent 告诉我我所在的 XML 文件的具体位置，以及根据上下文分配或打印一个值)

The relevant piece of code is below -- it sets context markers (inReportHost and inReportEvent which tell me where in the stricture of the XML file I am in, and either assign or print a value, depending on the context)

import xml.etree.cElementTree as ET inReportHost = False inReportItem = False for event, elem in ET.iterparse("test2.nessus", events=("start", "end")): if event == 'start' and elem.tag == "ReportHost": inReportHost = True if event == 'end' and elem.tag == "ReportHost": inReportHost = False elem.clear() if inReportHost: if event == 'start' and elem.tag == 'ReportItem': inReportItem = True cvss = '' if event == 'start' and inReportItem: if event == 'start' and elem.tag == 'cvss_base_score': cvss = elem.text if event == 'end' and elem.tag == 'ReportItem': print cvss inReportItem = False

cvss 有时具有 None 值(在 cvss = elem.text 赋值之后)，即使相同的条目已在文件的前面正确解析.

cvss sometimes has the None value (after the cvss = elem.text assignment), even though identical entries have been parsed properely earlier in the file.

如果我在分配下面添加一些类似的东西

If I add below the assignement something along the lines of

if cvss is None: cvss = "0"

然后解析许多进一步的 cvss 分配它们的正确值(还有一些是 None ).

then the parsing of many further cvss assign their proper values (and some other are None).

当使用 <ReportHost>...</reportHost> 这会导致错误的解析并通过程序运行它 - 它工作正常(即.cvss 按预期分配了 9.3).

When taking the <ReportHost>...</reportHost> which causes the wrong parsing and running it through the program - it works fine (ie. cvss is assigned 9.3 as expected).

我迷失在我的代码中出现错误的地方，因为有大量相似的记录，有些已正确处理，有些 - 未正确处理(有些记录是相同的，但处理方式仍然不同).我也找不到任何关于失败记录的具体信息 - 早晚相同的记录都可以.

I am lost at where I make a mistake in my code since, withing a large set of similar records, some apre processed correctly and some - not (some of the records are identical, and still are processed differently). I also cannot find anything particular about the records that fail - identical ones earlier and later are fine.

解决方案

来自 iterparse() 文档:

注意:iterparse() 只保证它已经看到了>"字符当它发出一个开始"事件时，它的起始标签，所以属性是已定义，但 text 和 tail 属性的内容是那时未定义.这同样适用于子元素；它们可能存在也可能不存在.如果您需要一个完全填充的元素，而是寻找结束"事件.

Note: iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present. If you need a fully populated element, look for "end" events instead.

删除 inReport* 变量并在完全解析后仅在结束"事件上处理 ReportHost.使用 ElementTree API 从当前 ReportHost 元素中获取必要的信息，例如 cvss_base_score.

Drop inReport* variables and process ReportHost only on "end" events when it fully parsed. Use ElementTree API to get necessary info such as cvss_base_score from current ReportHost element.

要保留内存，请执行以下操作:

To preserve memory, do:

import xml.etree.cElementTree as etree def getelements(filename_or_file, tag): context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) _, root = next(context) # get root element for event, elem in context: if event == 'end' and elem.tag == tag: yield elem root.clear() # preserve memory for host in getelements("test2.nessus", "ReportHost"): for cvss_el in host.iter("cvss_base_score"): print(cvss_el.text)

相关文章