Python中的XML到CSV
问题描述
我在 Python 中将 XML 文件转换为 CSV 时遇到了很多麻烦.我查看了很多论坛,尝试了 lxml 和 xmlutils.xml2csv,但我无法让它工作.这是来自 Garmin GPS 设备的 GPS 数据.
I'm having a lot of trouble converting an XML file to a CSV in Python. I've looked at many forums, tried both lxml and xmlutils.xml2csv, but I can't get it to work. It's GPS data from a Garmin GPS device.
这是我的 XML 文件的样子,当然是缩短的:
Here's what my XML file looks like, shortened of course:
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
在我庞大的 XML 文件中有几个 trk 标签,但我可以设法将它们分开——它们代表 GPS 设备上的不同段"或行程.我想要的只是一个 CSV 文件,它可以绘制如下内容:
There are several trk tags in my massive XML file, but I can manage to separate them out -- they represent different "segments" or trips on the GPS device. All I want is a CSV file that plots something like this:
LAT LON TIME ELE
45.4... -122.6... 2013-12... 0.00...
... ... ... ...
这是我到目前为止的代码:
Here's the code I have so far:
## Call libraries
import csv
from xmlutils.xml2csv import xml2csv
inputs = "myfile.xml"
output = "myfile.csv"
converter = xml2csv(inputs, output)
converter.convert(tag="WHATEVER_GOES_HERE_RENDERS_EMPTY_CSV")
这是另一个替代代码.它只输出一个没有数据的 CSV 文件,只有标题 lat
和 lon
.
This is another alternative code. It merely outputs a CSV file with no data, just the headers lat
and lon
.
import csv
import lxml.etree
x = '''
<?xml version="1.0" encoding="utf-8"?>
<gpx xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns="http://www.topografix.com/GPX/1/1" version="1.1" creator="TC2 to GPX11 XSLT stylesheet" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd">
<trk>
<name>2013-12-03T21:08:56Z</name>
<trkseg>
<trkpt lat="45.4852855" lon="-122.6347885">
<ele>0.0000000</ele>
<time>2013-12-03T21:08:56Z</time>
</trkpt>
<trkpt lat="45.4852961" lon="-122.6347926">
<ele>0.0000000</ele>
<time>2013-12-03T21:09:00Z</time>
</trkpt>
<trkpt lat="45.4852982" lon="-122.6347897">
<ele>0.2000000</ele>
<time>2013-12-03T21:09:01Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
'''
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(('lat', 'lon'))
root = lxml.etree.fromstring(x)
for trkpt in root.iter('trkpt'):
row = trkpt.get('lat'), trkpt.get('lon')
writer.writerow(row)
我该怎么做?请意识到我是新手,所以更全面的解释会非常棒!
How do I do this? Please realize I'm a novice, so a more comprehensive explanation would be super awesome!
解决方案
这是一个命名空间 XML 文档.因此,您需要使用它们各自的命名空间来寻址节点.
This is a namespaced XML document. Therefore you need to address the nodes using their respective namespaces.
文档中使用的命名空间定义在顶部:
The namespaces used in the document are defined at the top:
xmlns:tc2="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:tp1="http://www.garmin.com/xmlschemas/TrackPointExtension/v1"
xmlns="http://www.topografix.com/GPX/1/1"
所以第一个命名空间被映射到短格式tc2
,并且会被用在像<tc2:foobar/>
这样的元素中.最后一个,在 xmlns
之后没有简短的形式,称为 default namespace,它适用于文档中没有明确表示的所有元素使用命名空间 - 所以它也适用于您的 <trkpt/>
元素.
So the first namespace is mapped to the short form tc2
, and would be used in an element like <tc2:foobar/>
. The last one, which doesn't have a short form after the xmlns
, is called the default namespace, and it applies to all elements in the document that don't explicitely use a namespace - so it applies to your <trkpt />
elements as well.
因此,您需要编写 root.iter('{http://www.topografix.com/GPX/1/1}trkpt')
来选择这些元素.
Therefore you would need to write root.iter('{http://www.topografix.com/GPX/1/1}trkpt')
to select these elements.
为了同时获得时间和海拔,您可以使用 trkpt.find()
访问 trkpt
节点下的这些元素,然后使用 元素.text
来检索这些元素的文本内容(与 lat
和 lon
等属性相反).此外,由于 time
和 ele
元素也使用默认命名空间,因此您必须再次使用 {namespace}element
语法来选择那些节点.
In order to also get time and elevation, you can use trkpt.find()
to access these elements below the trkpt
node, and then element.text
to retrieve those elements' text content (as opposed to attributes like lat
and lon
). Also, because the time
and ele
elements also use the default namespace you'll have to use the {namespace}element
syntax again to select those nodes.
所以你可以使用这样的东西:
So you could use something like this:
NS = 'http://www.topografix.com/GPX/1/1'
header = ('lat', 'lon', 'ele', 'time')
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(header)
root = lxml.etree.fromstring(x)
for trkpt in root.iter('{%s}trkpt' % NS):
lat = trkpt.get('lat')
lon = trkpt.get('lon')
ele = trkpt.find('{%s}ele' % NS).text
time = trkpt.find('{%s}time' % NS).text
row = lat, lon, ele, time
writer.writerow(row)
有关 XML 命名空间的更多信息,请参阅 lxml 教程中的 命名空间部分和 关于 XML 命名空间的维基百科文章.另请参阅 GPS eXchange 格式,了解有关 .gpx
格式的一些详细信息.
For more information on XML namespaces, see the Namespaces section in the lxml tutorial and the Wikipedia article on XML Namespaces. Also see GPS eXchange Format for some details on the .gpx
format.
相关文章