按重复对 DataFrame 的行进行排序

2022-01-10 00:00:00 python pandas dataframe sorting duplicates

问题描述

如何对 DataFrame 进行排序，以便回收"重复列中的行?

How can I sort a DataFrame so that rows in the duplicate column are "recycled"?

例如，我原来的 DataFrame 是这样的:

For example, my original DataFrame looks like this:

In [3]: df Out[3]: A B 0 r1 0 1 r1 1 2 r2 2 3 r2 3 4 r3 4 5 r3 5

我希望它转向:

In [3]: df_sorted Out[3]: A B 0 r1 0 2 r2 2 4 r3 4 1 r1 1 3 r2 3 5 r3 5

对行进行排序，使得列 A 中的行处于回收"状态.时尚.

Rows are sorted such that rows in columns A are in a "recycled" fashion.

我在 Pandas 中搜索过 API，但似乎没有任何合适的方法可以这样做.我可以编写一个复杂的函数来完成此操作，但只是想知道是否有任何智能方法或现有的 pandas 方法可以做到这一点?提前非常感谢.

I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.

更新:为错误的陈述道歉.在我真正的问题中，列 B 包含字符串值.

Update: Apologies for a wrong statement. In my real problem, column B contains string values.

解决方案

你可以使用cumcount 用于计算列 A 中的重复项，然后是 sort_values 首先由 A (在示例没必要，在实际数据中可能很重要)，然后通过 C.最后删除列 C 由 <代码>丢弃:

You can use cumcount for counting duplicates in column A, then sort_values first by A (in sample not necessary, in real data maybe important) and then by C. Last remove column C by drop:

df['C'] = df.groupby('A')['A'].cumcount() df.sort_values(by=['C', 'A'], inplace=True) print (df) A B C 0 r1 0 0 2 r2 2 0 4 r3 4 0 1 r1 1 1 3 r2 3 1 5 r3 5 1 df.drop('C', axis=1, inplace=True) print (df) A B 0 r1 0 2 r2 2 4 r3 4 1 r1 1 3 r2 3 5 r3 5

时间安排:

小df (len(df)=6)

In [26]: %timeit (jez(df)) 1000 loops, best of 3: 2 ms per loop In [27]: %timeit (boud(df1)) 100 loops, best of 3: 2.52 ms per loop

大 df (len(df)=6000)

In [23]: %timeit (jez(df)) 100 loops, best of 3: 3.44 ms per loop In [28]: %timeit (boud(df1)) 100 loops, best of 3: 2.52 ms per loop

计时代码:

df = pd.concat([df]*1000).reset_index(drop=True) df1 = df.copy() def jez(df): df['C'] = df.groupby('A')['A'].cumcount() df.sort_values(by=['C', 'A'], inplace=True) df.drop('C', axis=1, inplace=True) return (df) def boud(df): df['C'] = df.groupby('A')['B'].rank() df = df.sort_values(['C', 'A']) df.drop('C', axis=1, inplace=True) return (df) 100 loops, best of 3: 4.29 ms per loop

相关文章