pandas 数据框:基于列和时间范围的重复



I have a (very simplyfied here) pandas dataframe which looks like this:


    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我现在想做的是获取所有时间戳在 3 秒内的重复消息.期望的输出是:

What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you


without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.

我尝试将列 datetime 和 msg 定义为 duplicate() 方法的参数,但它返回一个空数据帧,因为时间戳不相同:

I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []


Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)


Any help here would as always be very much appreciated.



This Piece of code gives the expected output

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我已对数据框的msg"列进行分组,然后选择该数据框的日期时间"列并使用内置函数 差异.Diff 函数查找该列的值之间的差异.用零填充 NaT 值并仅选择那些值小于 3 秒的索引.

I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.


Before using above code make sure that your dataframe is sorted on datetime in ascending order.
