计算 Pandas GroupBy 对象中的日期差异

2022-01-11 00:00:00 python pandas time-series

问题描述

我有一个格式如下的 Pandas DataFrame:

I have a Pandas DataFrame with the following format:

In [0]: df Out[0]: col1 col2 date 0 1 1 2015-01-01 1 1 2 2015-01-09 2 1 3 2015-01-10 3 2 1 2015-02-10 4 2 2 2015-02-10 5 2 3 2015-02-25 In [1]: df.dtypes Out[1]: col1 int64 col2 int64 date datetime64[ns] dtype: object

我们希望找到与日期最大差异(按日期排序的组中的连续元素之间)相对应的 col2 值，按 col1 分组.假设没有大小为 1 的组.

We want to find the value for col2 corresponding to the greatest difference in date (between consecutive elements in the sorted-by-dates groups), grouped by col1. Assume there are no groups of size 1.

期望的输出

In [2]: output Out[2]: col1 col2 1 1 # This is because the difference between 2015-01-09 and 2015-01-01 is the greatest 2 2 # This is because the difference between 2015-02-25 and 2015-02-10 is the greatest

真正的 df 有很多 col1 的值，我们需要通过 groupby 来进行计算.这可以通过对以下应用函数来实现吗?请注意，日期已经按升序排列.

The real df has many values for col1 that we need to groupby to do calculations. Is this possible by applying a function to the following? Please note, the dates are already in ascending order.

gb = df.groupby(col1) gb.apply(right_maximum_date_difference)

解决方案

这几乎是你的数据框(我避免复制日期):

Here's something that's almost your dataframe (I avoided copying the dates):

df = pd.DataFrame({ 'col1': [1, 1, 1, 2, 2, 2], 'col2': [1, 2, 3, 1, 2, 3], 'date': [1, 9, 10, 10, 10, 25] })

有了这个，定义:

def max_diff_date(g): g = g.sort(columns=['date']) return g.col2.ix[(g.date.ix[1: ] - g.date.shift(1)).argmax() - 1]

你有:

>> df.groupby(df.col1).apply(max_diff_date) col1 1 1 2 2 dtype: int64

相关文章