如何告诉 lxml.etree.tostring(element) 不要在 python 中编写命名空间?
问题描述
我有一个巨大的 xml 文件 (1 Gig).我想将一些元素(条目)移动到具有相同标题和规范的另一个文件中.
假设原始文件包含带有标签<to_move>
的条目:
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE some SYSTEM "some.dtd"><一些>...<to_move date="somedate"><child>一些文字</child>......</to_move>...</一些>
我使用 lxml.etree.iterparse 来遍历文件.工作正常.当我找到带有标签 <to_move>
的元素时,假设它存储在变量 element
我做
new_file.write(etree.tostring(element))
但这会导致
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE some SYSTEM "some.dtd"><一些>...<to_move xmlns:="some" date="somedate"># <---- 这就是问题所在.我不想要命名空间.<child>一些文字</child>......</to_move>...</一些>
所以问题是:如何告诉 etree.tostring() 不要写 xmlns:="some"
.这可能吗?我在 lxml.etree 的 api-documentation 中苦苦挣扎,但找不到令人满意的答案.
这是我为 etree.trostring
找到的:
tostring(element_or_tree, encoding=None, method="xml",xml_declaration=无,pretty_print=False,with_tail=True,独立=无,文档类型=无,排他=假,with_comments=真)
<块引用>
将元素序列化为其 XML 的编码字符串表示树.
对我来说,tostring()
的每个参数似乎都没有帮助.有什么建议或更正吗?
我经常像这样抓取一个命名空间为它创建一个别名:
someXML = lxml.etree.XML(someString)如果 ns 为无:ns = {"m": someXML.tag.split("}")[0][1:]}someid = someXML.xpath('.//m:ImportantThing//m:ID', namespaces=ns)
你可以做一些类似的事情来获取命名空间,以便在使用 tostring
后创建一个正则表达式来清理它.
或者你可以清理输入字符串.找到第一个空格,检查后面是否有xmlns,如果是,则删除整个xmlns直到下一个空格,如果没有则删除空格.重复直到没有更多的空格或 xmlns 声明.但不要超过第一个 >
.
I have a huge xml file (1 Gig). I want to move some of the elements (entrys) to another file with the same header and specifications.
Let's say the original file contains this entry with tag <to_move>
:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE some SYSTEM "some.dtd">
<some>
...
<to_move date="somedate">
<child>some text</child>
...
...
</to_move>
...
</some>
I use lxml.etree.iterparse to iterate through the file. Works fine. When I find the element with tag <to_move>
, let's assume it is stored in the variable element
I do
new_file.write(etree.tostring(element))
But this results in
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE some SYSTEM "some.dtd">
<some>
...
<to_move xmlns:="some" date="somedate"> # <---- Here is the problem. I don't want the namespace.
<child>some text</child>
...
...
</to_move>
...
</some>
So the question is: How to tell etree.tostring() not to write the xmlns:="some"
. Is this possible? I struggeled with the api-documentation of lxml.etree, but I couldn't find a satisfying answer.
This is what I found for etree.trostring
:
tostring(element_or_tree, encoding=None, method="xml",
xml_declaration=None, pretty_print=False, with_tail=True,
standalone=None, doctype=None, exclusive=False, with_comments=True)
Serialize an element to an encoded string representation of its XML tree.
To me every one of the parameters of tostring()
does not seem to help. Any suggestion or corrections?
I often grab a namespace to make an alias for it like this:
someXML = lxml.etree.XML(someString)
if ns is None:
ns = {"m": someXML.tag.split("}")[0][1:]}
someid = someXML.xpath('.//m:ImportantThing//m:ID', namespaces=ns)
You could do something similar to grab the namespace in order to make a regex that will clean it up after using tostring
.
Or you could clean up the input string. Find the first space, check if it is followed by xmlns, if yes, delete the whole xmlns bit up to the next space, if no delete the space. Repeat until there are no more spaces or xmlns declarations. But don't go past the first >
.
相关文章