如何解析无效(坏/格式不正确)的 XML?

2022-01-10 00:00:00 xml xml-parsing java xml-validation

目前，我正在开发一项涉及解析我们从其他产品接收到的 XML 的功能.我决定对一些实际的客户数据进行一些测试，看起来其他产品允许来自用户的输入，这些输入应该被认为是无效的.无论如何，我仍然必须尝试找出一种解析它的方法.我们正在使用 javax.xml.parsers.DocumentBuilder，我收到如下所示的输入错误.

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml> ... <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description> ... </xml>

如您所知，描述中似乎包含无效标签 (<THIS-IS-PART-OF-DESCRIPTION>).现在，这个描述标签被称为叶子标签，里面不应该有任何嵌套标签.无论如何，这仍然是一个问题，并在 DocumentBuilder.parse(...)

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

我知道这是无效的 XML，但可以预见它是无效的.关于解析此类输入的任何想法?

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

推荐答案

那个XML"比 invalid 更糟糕——它格式不正确；请参阅格式正确与有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

对违法行为的可预测性进行非正式评估没有帮助.该文本数据不是 XML.没有符合标准的 XML 工具或库可以帮助您处理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

让供应商自行解决问题.要求格式良好的 XML.(从技术上讲，短语格式良好的 XML 是多余的，但可能有助于强调.)

Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容错标记解析器在解析为 XML 之前清理问题:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

独立: xmlstarlet 具有强大的恢复和修复功能能力^{_{来源:RomanPerekhrest}}

Standalone: xmlstarlet has robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

独立和 C/C++: HTML Tidy 有效也有 XML.Taggle 是一个端口将 TagSoup 转换为 C++.

Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

Python: 美汤是基于 Python 的.请参阅解析器之间的差异部分中的注释.另请参阅对这个问题的回答了解更多信息在 Python 中处理格式不正确的标记的建议，尤其包括 lxml 的 recover=True 选项.另请参阅this answer了解如何使用 codecs.EncodedFile() 清除非法字符.

Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

Java: TagSoup 和JSoup 专注于 HTML.FilterInputStream 可以用于预处理清理.

Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

.NET:

XmlReaderSettings.CheckCharacters 可以被禁用以解决非法 XML 字符问题.
@jdweng 笔记那 XmlReaderSettings.ConformanceLevel 可以设置为ConformanceLevel.Fragment 以便 XmlReader 可以读取 XML 格式良好的已解析实体缺少根元素.
@jdweng 还报告 XmlReader.ReadToFollowing() 有时可以用于解决 XML 语法问题，但请注意下面 #3 中的违规警告.
Microsoft.Language.Xml.XMLParser 被称为错误-宽容".

XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.

@jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.

@jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.

Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

PHP: 参见 DOMDocument::$recover 和 libxml_use_internal_errors(true).在这里查看很好的例子.

PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

Ruby: Nokogiri 支持Gentle Well-形成性".

Ruby: Nokogiri supports "Gentle Well-Formedness".

R:参见htmlTreeParse() 用于 R 中的容错标记解析.

R: See htmlTreeParse() for fault-tolerant markup parsing in R.

Perl: 参见 XML::Liberal，一个解析损坏的 XML 的超级自由 XML 解析器".

Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

将数据处理为文本使用文本编辑器手动或以编程方式使用字符/字符串函数.这样做以编程方式可以从棘手到不可能作为看起来是什么通常不可预测 -- 规则破坏很少受规则约束.

Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

对于无效字符错误，使用正则表达式删除/替换无效字符:

For invalid character errors, use regex to remove/replace invalid characters:

PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000‌ }-u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

对于 & 符号，使用正则表达式将匹配项替换为 &:^{_{credit: blhsin，演示}}

For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}

&(?!(?:#d+|#x[0-9a-f]+|w+);)

请注意，上述正则表达式不会接受注释或 CDATA部分考虑在内.

Note that the above regular expressions won't take comments or CDATA sections into account.

相关文章