Pandas MultiIndex 中的重采样

2022-01-11 00:00:00 python pandas multi-index time-series hierarchical-data

问题描述

我有一些分层数据，这些数据从底部变成时间序列数据，看起来像这样:

I have some hierarchical data which bottoms out into time series data which looks something like this:

df = pandas.DataFrame( {'value_a': values_a, 'value_b': values_b}, index=[states, cities, dates]) df.index.names = ['State', 'City', 'Date'] df value_a value_b State City Date Georgia Atlanta 2012-01-01 0 10 2012-01-02 1 11 2012-01-03 2 12 2012-01-04 3 13 Savanna 2012-01-01 4 14 2012-01-02 5 15 2012-01-03 6 16 2012-01-04 7 17 Alabama Mobile 2012-01-01 8 18 2012-01-02 9 19 2012-01-03 10 20 2012-01-04 11 21 Montgomery 2012-01-01 12 22 2012-01-02 13 23 2012-01-03 14 24 2012-01-04 15 25

我想对每个城市进行时间重采样，比如

I'd like to perform time resampling per city, so something like

df.resample("2D", how="sum")

会输出

value_a value_b State City Date Georgia Atlanta 2012-01-01 1 21 2012-01-03 5 25 Savanna 2012-01-01 9 29 2012-01-03 13 33 Alabama Mobile 2012-01-01 17 37 2012-01-03 21 41 Montgomery 2012-01-01 25 45 2012-01-03 29 49

按原样，df.resample('2D', how='sum') 让我明白了

TypeError: Only valid with DatetimeIndex or PeriodIndex

很公平，但我有点希望这能奏效:

Fair enough, but I'd sort of expect this to work:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum') TypeError: Only valid with DatetimeIndex or PeriodIndex

在这一点上我真的没有想法了......有什么方法可以帮助我吗?

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?

解决方案

pd.Grouper允许您指定目标对象的 groupby 指令".在特别是，即使 df.index 不是 DatetimeIndex，您也可以使用它按日期分组:

pd.Grouper allows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.index is not a DatetimeIndex:

df.groupby(pd.Grouper(freq='2D', level=-1))

level=-1 告诉 pd.Grouper 在 MultiIndex 的最后一级查找日期.此外，您可以将其与索引中的其他级别值结合使用:

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:

level_values = df.index.get_level_values result = (df.groupby([level_values(i) for i in [0,1]] +[pd.Grouper(freq='2D', level=-1)]).sum())

看起来有点尴尬，但是 using_Grouper 结果比我原来的要快得多建议，using_reset_index:

It looks a bit awkward, but using_Grouper turns out to be much faster than my original suggestion, using_reset_index:

import numpy as np import pandas as pd import datetime as DT def using_Grouper(df): level_values = df.index.get_level_values return (df.groupby([level_values(i) for i in [0,1]] +[pd.Grouper(freq='2D', level=-1)]).sum()) def using_reset_index(df): df = df.reset_index(level=[0, 1]) return df.groupby(['State','City']).resample('2D').sum() def using_stack(df): # http://stackoverflow.com/a/15813787/190597 return (df.unstack(level=[0,1]) .resample('2D').sum() .stack(level=[2,1]) .swaplevel(2,0)) def make_orig(): values_a = range(16) values_b = range(10, 26) states = ['Georgia']*8 + ['Alabama']*8 cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4 dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4) df = pd.DataFrame( {'value_a': values_a, 'value_b': values_b}, index = [states, cities, dates]) df.index.names = ['State', 'City', 'Date'] return df def make_df(N): dates = pd.date_range('2000-1-1', periods=N) states = np.arange(50) cities = np.arange(10) index = pd.MultiIndex.from_product([states, cities, dates], names=['State', 'City', 'Date']) df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index, columns=['value_a', 'value_b']) return df df = make_orig() print(using_Grouper(df))

产量

value_a value_b State City Date Alabama Mobile 2012-01-01 17 37 2012-01-03 21 41 Montgomery 2012-01-01 25 45 2012-01-03 29 49 Georgia Atlanta 2012-01-01 1 21 2012-01-03 5 25 Savanna 2012-01-01 9 29 2012-01-03 13 33

<小时>
这是在 5000 行 DataFrame 上比较 using_Grouper、using_reset_index、using_stack 的基准:

In [30]: df = make_df(10) In [34]: len(df) Out[34]: 5000 In [32]: %timeit using_Grouper(df) 100 loops, best of 3: 6.03 ms per loop In [33]: %timeit using_stack(df) 10 loops, best of 3: 22.3 ms per loop In [31]: %timeit using_reset_index(df) 1 loop, best of 3: 659 ms per loop

相关文章