pandas 数据框:基于列和时间范围的重复

问题描述

我有一个(非常简单的)熊猫数据框,看起来像这样:

I have a (very simplyfied here) pandas dataframe which looks like this:

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我现在想做的是获取所有时间戳在 3 秒内的重复消息.期望的输出是:

What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

没有第三行,因为它的文本与第一行和第二行相同,但它的时间戳不是3秒以内.

without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.

我尝试将列 datetime 和 msg 定义为 duplicate() 方法的参数,但它返回一个空数据帧,因为时间戳不相同:

I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

有没有一种方法可以为我的日期时间"参数定义一个范围?为了说明,某事喜欢:

Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

我们将一如既往地为您提供任何帮助.

Any help here would as always be very much appreciated.


解决方案

这段代码给出了预期的输出

This Piece of code gives the expected output

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我已对数据框的msg"列进行分组,然后选择该数据框的日期时间"列并使用内置函数 差异.Diff 函数查找该列的值之间的差异.用零填充 NaT 值并仅选择那些值小于 3 秒的索引.

I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.

在使用上述代码之前,请确保您的数据框按日期时间升序排序.

Before using above code make sure that your dataframe is sorted on datetime in ascending order.

相关文章