Python 中的大型 XML 文件解析
问题描述
我有一个大小为 4 GB 的 XML 文件.我想解析它并将其转换为数据框来处理它.但由于文件太大,以下代码无法将文件转换为 Pandas 数据框.代码只是不断加载,不提供任何输出.但是当我将它用于较小尺寸的类似文件时,我得到了正确的输出.
I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.
任何人都可以提出任何解决方案.也许是一个代码可以加快从 XML 到 Data Frame 的转换过程或将 XML 文件拆分为更小的子集.
Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.
任何建议我是否应该在我的个人系统(2 GB RAM)上处理如此大的 XML 文件,或者我应该使用 Google Colab.如果是 Google Colab,那么有什么方法可以更快地将如此大的文件上传到驱动器,从而更快地上传到 Colab?
Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?
以下是我使用的代码:
import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()
#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']
#Creating DataFrame
df = pd.DataFrame(columns = columns)
#Converting XML Tree to a Pandas DataFrame
for node in root:
row_Id = node.attrib.get("Id")
UserId = node.attrib.get("UserId")
Name = node.attrib.get("Name")
Date = node.attrib.get("Date")
Class = node.attrib.get("Class")
TagBased = node.attrib.get("TagBased")
df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)
以下是我的 XML 文件:
Following is my XML File:
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
解决方案
考虑 iterparse
用于以增量方式构建树的快速流处理.在每次迭代中构建一个字典列表,然后您可以将其传递到 pandas.DataFrame
构造函数 once 外部循环.下面调整根子节点的重复节点名称:
Consider iterparse
for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame
constructor once outside loop. Adjust below to name of repeating nodes of root's children:
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/path/to/Input.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['UserId'],
'Name': elem.attrib['Name'],
'Date': elem.attrib['Date'],
'Class': elem.attrib['Class'],
'TagBased': elem.attrib['TagBased']})
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
相关文章