在对 pandas 的两列进行分组后应用合并

2022-04-14 00:00:00 python numpy pandas dataframe nan

问题描述

请帮帮我。 我有一个类似

的数据框
|    |    ID | Result   |   measurement_1 |   measurement_2 |   measurement_3 |   measurement_4 |   measurement_5 | start_time                 | end-time                   |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
|  0 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-20 21:24:03.390000 | 2020-10-20 23:46:36.990000 |
|  1 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-21 04:36:03.390000 | 2020-10-21 06:58:36.990000 |
|  2 | 12345 | nan      |           49584 |            2827 |             nan |             nan |             nan | 2020-10-21 09:24:03.390000 | 2020-10-21 11:46:36.990000 |
|  3 | 12345 | nan      |             nan |             nan |            3940 |             nan |             nan | 2020-10-21 14:12:03.390000 | 2020-10-21 16:34:36.990000 |
|  4 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-21 21:24:03.390000 | 2020-10-21 23:46:36.990000 |
|  5 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-22 02:40:51.390000 | 2020-10-22 05:03:24.990000 |
|  6 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-22 08:26:27.390000 | 2020-10-22 10:49:00.990000 |
|  7 | 12345 | Pass     |             nan |             nan |             nan |             392 |             304 | 2020-10-22 14:12:03.390000 | 2020-10-22 16:34:36.990000 |
|  8 | 12346 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-22 19:57:39.390000 | 2020-10-22 22:20:12.990000 |
|  9 | 12346 | nan      |           22839 |            4059 |             nan |             nan |             nan | 2020-10-23 01:43:15.390000 | 2020-10-23 04:05:48.990000 |
| 10 | 12346 | nan      |             nan |             nan |            4059 |             nan |             nan | 2020-10-23 07:28:51.390000 | 2020-10-23 09:51:24.990000 |
| 11 | 12346 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-23 13:14:27.390000 | 2020-10-23 15:37:00.990000 |
| 12 | 12346 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-23 19:00:03.390000 | 2020-10-23 21:22:36.990000 |
| 13 | 12346 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-24 00:45:39.390000 | 2020-10-24 03:08:12.990000 |
| 14 | 12346 | Fail     |             nan |             nan |             nan |            2938 |             495 | 2020-10-24 06:31:15.390000 | 2020-10-24 08:53:48.990000 |
| 15 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-24 12:16:51.390000 | 2020-10-24 14:39:24.990000 |
| 16 | 12345 | nan      |           62839 |            1827 |             nan |             nan |             nan | 2020-10-24 18:02:27.390000 | 2020-10-24 20:25:00.990000 |
| 17 | 12345 | nan      |             nan |             nan |            2726 |             nan |             nan | 2020-10-24 23:48:03.390000 | 2020-10-25 02:10:36.990000 |
| 18 | 12345 | nan      |             nan |             nan |             nan |             nan |             nan | 2020-10-25 05:33:39.390000 | 2020-10-25 07:56:12.990000 |
| 19 | 12345 | Fail     |             nan |             nan |             nan |             nan |            1827 | 2020-10-25 11:19:15.390000 | 2020-10-25 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+

and want my output to look like

+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
|    |    ID | Result   |   measurement_1 |   measurement_2 |   measurement_3 | measurement_4   |   measurement_5 | start_time                 | end-time                   |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
|  0 | 12345 | Pass     |           49584 |            2827 |            3940 | 392             |             304 | 2020-10-20 21:24:03.390000 | 2020-10-22 16:34:36.990000 |
|  1 | 12346 | Fail     |           22839 |            4059 |            4059 | 2938            |             495 | 2020-10-22 19:57:39.390000 | 2020-10-24 08:53:48.990000 |
|  2 | 12345 | Fail     |           62839 |            1827 |            2726 | nan            |            1827 | 2020-10-24 12:16:51.390000 | 2020-10-23 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
​
so far I am able to group the cols on `ID` and `Result`. Now want to apply the Coalesce to it (newDf)


    df = pd.read_excel("Test_Coalesce.xlsx")
    newDf  = df.groupby(['ID','Result'])
    newDf.all().reset_index()


解决方案

看起来您希望按ID的连续块分组。如果是:

blocks = df['ID'].ne(df['ID'].shift()).cumsum()

agg_dict = {k:'first' if k != 'end-time' else 'last' 
             for k in df.columns}
df.groupby(blocks).agg(agg_dict)

相关文章