Pandas:过去 n 天的平均值

2022-01-11 00:00:00 python pandas time-series aggregation

问题描述

我有一个像这样的 Pandas 数据框:

I have a Pandas data frame like this:

test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02', '2016-04-02','2016-04-03','2016-04-04', '2016-04-05','2016-04-06','2016-04-06'], 'User' : ['Mike','John','Mike','John','Mike','Mike', 'Mike','Mike','John'], 'Value' : [1,2,1,3,4.5,1,2,3,6] })

如下所示，数据集不一定每天都有观测值:

As you can see below, the data set does not have observations for every day necessarily:

Date User Value 0 2016-04-01 Mike 1.0 1 2016-04-01 John 2.0 2 2016-04-02 Mike 1.0 3 2016-04-02 John 3.0 4 2016-04-03 Mike 4.5 5 2016-04-04 Mike 1.0 6 2016-04-05 Mike 2.0 7 2016-04-06 Mike 3.0 8 2016-04-06 John 6.0

如果至少有一天可用，我想添加一个新列，显示过去 n 天(在本例中 n = 2)每个用户的平均值，否则它将具有 nan值.例如，在 2016-04-06，John 得到一个 nan，因为他没有 2016-04-05 和 2016 的数据-04-04.所以结果会是这样的:
I'd like to add a new column which shows the average value for each user for the past n days (in this case n = 2) if at least one day is available, else it would have nan value. For example, on 2016-04-06 John gets a nan because he has no data for 2016-04-05 and 2016-04-04. So the result will be something like this: Date User Value Value_Average_Past_2_days 0 2016-04-01 Mike 1.0 NaN 1 2016-04-01 John 2.0 NaN 2 2016-04-02 Mike 1.0 1.00 3 2016-04-02 John 3.0 2.00 4 2016-04-03 Mike 4.5 1.00 5 2016-04-04 Mike 1.0 2.75 6 2016-04-05 Mike 2.0 2.75 7 2016-04-06 Mike 3.0 1.50 8 2016-04-06 John 6.0 NaN 看了论坛里的几篇帖子，好像应该把group_by和自定义的rolling_mean结合起来，但是我不太明白怎么做. It seems that I should a combination of group_by and customized rolling_mean after reading several posts in the forum, but I couldn't quite figure out how to do it. 解决方案我想你可以使用先转换列 Date to_datetime，然后通过 Days//pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html" rel="noreferrer">groupby with resample 和最后一个 应用 滚动

I think you can use first convert column Date to_datetime, then find missing Days by groupby with resample and last apply rolling

test['Date'] = pd.to_datetime(test['Date']) df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first()) print df User Value User Date John 2016-04-01 John 2.0 2016-04-02 John 3.0 2016-04-03 NaN NaN 2016-04-04 NaN NaN 2016-04-05 NaN NaN 2016-04-06 John 6.0 Mike 2016-04-01 Mike 1.0 2016-04-02 Mike 1.0 2016-04-03 Mike 4.5 2016-04-04 Mike 1.0 2016-04-05 Mike 2.0 df1 = df.groupby(level=0)['Value'] .apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean()) .reset_index(name='Value_Average_Past_2_days')

print df1 User Date Value_Average_Past_2_days 0 John 2016-04-01 NaN 1 John 2016-04-02 2.00 2 John 2016-04-03 2.50 3 John 2016-04-04 3.00 4 John 2016-04-05 NaN 5 John 2016-04-06 NaN 6 Mike 2016-04-01 NaN 7 Mike 2016-04-02 1.00 8 Mike 2016-04-03 1.00 9 Mike 2016-04-04 2.75 10 Mike 2016-04-05 2.75 11 Mike 2016-04-06 1.50 print pd.merge(test, df1, on=['Date', 'User'], how='left') Date User Value Value_Average_Past_2_days 0 2016-04-01 Mike 1.0 NaN 1 2016-04-01 John 2.0 NaN 2 2016-04-02 Mike 1.0 1.00 3 2016-04-02 John 3.0 2.00 4 2016-04-03 Mike 4.5 1.00 5 2016-04-04 Mike 1.0 2.75 6 2016-04-05 Mike 2.0 2.75 7 2016-04-06 Mike 3.0 1.50 8 2016-04-06 John 6.0 NaN

相关文章