在 pandas 数据框中查找重复行

2022-01-10 00:00:00 python pandas dataframe duplicates

问题描述

我正在尝试在 pandas 数据框中查找重复行.

I am trying to find duplicates rows in a pandas dataframe.

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2']) df Out[15]: col1 col2 0 1 2 1 3 4 2 1 2 3 1 4 4 1 2 duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first') duplicate = df.loc[duplicate_bool == True] duplicate Out[16]: col1 col2 2 1 2 4 1 2

有没有办法添加引用第一个副本(保留的那个)的索引的列

Is there a way to add a column referring to the index of the first duplicate (the one kept)

duplicate Out[16]: col1 col2 index_original 2 1 2 0 4 1 2 0

注意:在我的情况下，df 可能非常大....

Note: df could be very very big in my case....

解决方案

使用groupby，新建一列索引，然后调用duplicated:

Use groupby, create a new column of indexes, and then call duplicated:

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') df[df.duplicated(subset=['col1','col2'], keep='first')] col1 col2 index_original 2 1 2 0 4 1 2 0

<小时>
详情

我groupby前两列然后调用transform + idxmin得到每个组的第一个索引.

I groupby first two columns and then call transform + idxmin to get the first index of each group.

df.groupby(['col1', 'col2']).col1.transform('idxmin') 0 0 1 1 2 0 3 3 4 0 Name: col1, dtype: int64

duplicated 给了我想要保留的值的布尔掩码:

duplicated gives me a boolean mask of values I want to keep:

df.duplicated(subset=['col1','col2'], keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool

剩下的只是布尔索引.

相关文章