pandas 重新采样到每月的特定工作日

2022-01-11 00:00:00 python pandas time-series

问题描述

我有一个 Pandas 数据框,我想在每个月的第三个星期五重新采样一次.

I have a Pandas dataframe where I'd like to resample to every third Friday of the month.

np.random.seed(0)
#requested output:
dates = pd.date_range("2018-01-01", "2018-08-31")
dates_df = pd.DataFrame(data=np.random.random(len(dates)), index=dates)
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]

但是当缺少第三个星期五时(例如,下降 2 月的第三个星期五星期五),我想要最新的值(截至2018-02-15).使用掩码给了我下一个值(2 月 17 日而不是 2 月 15 日):

But when a third Friday is missing (e.g. dropping Feb third Friday), I want to have the latest value (so as of 2018-02-15). Using the mask gives me the next value (Feb 17 instead of Feb 15):

# remove February third Friday:
dates_df = dates_df.drop([pd.to_datetime("2018-02-16")])
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]

将每月重采样与 loffset 结合使用可得出月末值并抵消索引,这也不是我想要的:

Using monthly resample in combination with loffset gives the end of month values with offsetting the index, which is also not what I want:

from pandas.tseries.offsets import WeekOfMonth
dates_df.resample("M", loffset=WeekOfMonth(week=2, weekday=4)).last()

是否有替代方法(最好使用重新采样),而不必先重新采样到每日值然后添加掩码(这需要很长时间才能在我的数据帧上完成)

Is there an alternative (preferably using resample) without having to resample to daily values first and then adding a mask (this takes a long time to complete on my dataframe)


解决方案

您的第二次尝试是在正确的方向 IIUC,您只需要使用 WeekOfMonth 作为规则重新采样,而不是使用它作为偏移量:

Your second attempt is in the right direction IIUC, you just need to resample using WeekOfMonth as the rule, rather than using it as an offset:

dates_df.resample(WeekOfMonth(week=2, weekday=4)).asfreq().dropna()

这种方法不会抵消索引,它应该只是返回每个月第三个星期五的数据.

This approach will not offset the index, it should just return the data for the third Friday for every month.

处理失踪的第三个星期五:

Dealing with Missing 3rd Friday:

使用上面的代码,如果您缺少第三个星期五,则整个月都将被排除在外.但是根据你想如何处理丢失的数据,你可以bfillffillpad..你可以将上面的修改为以下:

With the above code, if you have a missing 3rd Friday the whole month will be excluded. But depending on how you want to deal with missing data, you can bfill, ffill, pad.. you can amend the above to the following:

dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()

上面将 bfill 缺少的第三个星期五与下一个值.

The above will bfill the missing 3rd Friday with the next value.

更新:让我们使用固定数据集而不是 np.random:

Update: Lets work with a fixed data set instead of np.random:

# create a smaller daterange
dates = pd.date_range("2018-05-01", "2018-08-31")

# create a data with only 1,2,3 values
data = [1,2,3] * int(len(dates)/3)

dates_df = pd.DataFrame(data=data, index=dates)
dates_df.head()

# Output:

2018-05-01  1
2018-05-02  2
2018-05-03  3
2018-05-04  1
2018-05-05  2

现在让我们通过手动选择来检查每个月的第三个星期五的数据是什么样的:

Now let's check what the data looks like for the 3rd Friday of each month by selecting it manually:

dates_df.loc[[
    pd.Timestamp('2018-05-18'),
    pd.Timestamp('2018-06-15'),
    pd.Timestamp('2018-07-20'),
    pd.Timestamp('2018-08-17')
]]

Output:

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  1

如果您没有错过任何第三个星期五并运行之前提供的代码:

If you dont have any missing 3rd Fridays and running the code provided earlier:

dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).asfreq().dropna()

将产生以下输出:

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  1

如您所见,这里的索引没有移动,它返回了每个月第三个星期五的准确值.

As you can see the index has not been shifted here and it returned the exact values for the 3rd Friday of each month.

现在说你确实缺少一些第三个星期五,这取决于你想怎么做(使用上一个值:ffill,或下一个值bfill):

Now say you do have some 3rd Fridays missing, depending how you want to do it (use previous value: ffill, or next value bfill):

  • pad/ffill:将最后一个有效观察值向前传播到下一个有效值
  • backfill/bfill:使用 NEXT 有效观察来填补空白

dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).ffill().asfreq(freq='D').dropna()

2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  3

dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).bfill().asfreq(freq='D').dropna()

2018-04-20  1
2018-05-18  3
2018-06-15  1
2018-07-20  3
2018-08-17  2

如果说整个索引像您的示例一样移动:

If say the whole index was shifted like your example:

dates_df.resample(rule='M', loffset=WeekOfMonth(week=2, weekday=4)).asfreq().dropna()

# Output:

2018-06-15  1
2018-07-20  1
2018-08-17  2
2018-09-21  3

发生的情况是您按照规则M"(月末)重新采样,然后在每个月的第三个星期五抵消(向前移动)索引.

Whats happening there is you're resampling by rule 'M' (month end) and then you're offsetting (shifting forward) the index by the 3rd Friday of each Month.

正如你在偏移之前看到的,它的样子是这样的:

As you can see before the offset, this how it looks like:

dates_df.resample(rule='M').asfreq().dropna()

# Output

2018-05-31  1
2018-06-30  1
2018-07-31  2
2018-08-31  3

相关文章