pandas :在 groupby 'date' 中删除重复项

2022-01-10 00:00:00 python pandas pandas-groupby duplicates unique

问题描述

在下面的数据框中，我想消除重复的 cid 值，以便 df.groupby('date').cid.size() 的输出匹配df.groupby('date').cid.nunique() 的输出.

In the dataframe below, I would like to eliminate the duplicate cid values so the output from df.groupby('date').cid.size() matches the output from df.groupby('date').cid.nunique().

我看过这个 post 但它似乎没有解决问题的可靠方法.

I have looked at this post but it does not seem to have a solid solution to the problem.

df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df') df.groupby('date').cid.size() date 2005 7 2006 237 2007 3610 2008 1318 2009 2664 2010 997 2011 6390 2012 2904 2013 7875 2014 3979 df.groupby('date').cid.nunique() date 2005 3 2006 10 2007 227 2008 52 2009 142 2010 57 2011 219 2012 99 2013 238 2014 146 Name: cid, dtype: int64

我尝试过的事情:

df.groupby([df['date']]).drop_duplicates(cols='cid') 给出此错误:AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' 对象，尝试使用 'apply' 方法
df.groupby(('date').drop_duplicates('cid')) 给出此错误:AttributeError: 'str' object has no attribute 'drop_duplicates'

df.groupby([df['date']]).drop_duplicates(cols='cid') gives this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method

df.groupby(('date').drop_duplicates('cid')) gives this error: AttributeError: 'str' object has no attribute 'drop_duplicates'

解决方案

您不需要 groupby 删除基于几列的重复项，您可以指定一个子集:

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"]) df2.groupby('date').cid.size() Out[99]: date 2005 3 2006 10 2007 227 2008 52 2009 142 2010 57 2011 219 2012 99 2013 238 2014 146 dtype: int64

相关文章