在 Pandas 中解析多索引 Excel 文件
问题描述
我有一个包含三级列 MultiIndex 的时间序列 Excel 文件,如果可能的话,我想成功解析它.有一些关于如何为堆栈溢出索引而不是列执行此操作的结果,并且 parse
函数有一个 header
似乎没有获取行列表.
I have a time series excel file with a tri-level column MultiIndex that I would like to successfully parse if possible. There are some results on how to do this for an index on stack overflow but not the columns and the parse
function has a header
that does not seem to take a list of rows.
ExcelFile 如下所示:
The ExcelFile looks like is like the following:
- A 列是从 A4 开始的所有时间序列日期
- B 列有 top_level1 (B1) mid_level1 (B2) low_level1 (B3) 数据 (B4-B100+)
- C 列有 null (C1) null (C2) low_level2 (C3) 数据 (C4-C100+)
- D 列有空 (D1) mid_level2 (D2) low_level1 (D3) 数据 (D4-D100+)
- E 列有 null (E1) null (E2) low_level2 (E3) 数据 (E4-E100+)
- ...
所以有两个 low_level
值,许多 mid_level
值和几个 top_level
值,但诀窍是顶层和中级值是 null并假定为左边的值.因此,例如上面的所有列都将 top_level1 作为顶部多索引值.
So there are two low_level
values many mid_level
values and a few top_level
values but the trick is the top and mid level values are null and are assumed to be the values to the left. So, for instance all the columns above would have top_level1 as the top multi-index value.
到目前为止,我最好的想法是使用 transpose
,但它会在任何地方填充 Unnamed: #
并且似乎不起作用.在 Pandas 0.13 中,read_csv
似乎有一个可以获取列表的 header
参数,但这似乎不适用于 parse
.
My best idea so far is to use transpose
, but the it fills Unnamed: #
everywhere and doesn't seem to work. In Pandas 0.13 read_csv
seems to have a header
parameter that can take a list, but this doesn't seem to work with parse
.
解决方案
你可以fillna
空值.我没有你的文件,但你可以测试
You can fillna
the null values. I don't have your file, but you can test
#Headers as rows for now
df = pd.read_excel(xls_file,0, header=None, index_col=0)
#fill in Null values in "Headers"
df = df.fillna(method='ffill', axis=1)
#create multiindex column names
df.columns=pd.MultiIndex.from_arrays(df[:3].values, names=['top','mid','low'])
#Just name of index
df.index.name='Date'
#remove 3 rows which are already used as column names
df = df[pd.notnull(df.index)]
相关文章