在数据帧上使用自定义条件的PANAS数据透视表

2022-03-03 00:00:00 python pandas pivot-table

问题描述

我想根据数据框中的自定义条件创建一个透视表：

数据帧如下所示：

>>> df = pd.DataFrame({"Area": ["A", "A", "B", "A", "C", "A", "D", "A"],
                       "City" : ["X", "Y", "Z", "P", "Q", "R", "S", "X"],
                       "Condition" : ["Good", "Bad", "Good", "Good", "Good", "Bad", "Good", "Good"], 
                       "Population" : [100,150,50,200,170,390,80,100]
                       "Pincode" : ["X1", "Y1", "Z1", "P1", "Q1", "R1", "S1", "X2"] })
>>> df
  Area City Condition   Population Pincode
 0    A    X      Good   100       X1
 1    A    Y       Bad   150       Y1
 2    B    Z      Good   50        Z1
 3    A    P      Good   200       P1
 4    C    Q      Good   170       Q1
 5    A    R       Bad   390       R1
 6    D    S      Good   80        S1
 7    A    X      Good   100       X2

现在我想以这样的方式透视数据帧df，这样我就可以看到每个区域的唯一城市计数，以及相应的"好"城市计数和该区域的人口。

我希望得到如下输出：

Area  city_count  good_city_count   Population
A        4        2                 940
B        1        1                 50
C        1        1                 170
D        1        1                 80
All      7        5                 1240

我可以为aggfunc参数提供字典，但这不会给出好城市之间的城市计数。

>>> city_count = df.pivot_table(index=["Area"],
                                values=["City", "Population"],
                                aggfunc={"City": lambda x: len(x.unique()),
                                         "Population": "sum"},
                                margins=True)

    Area    City    Population
0   A       4       940
1   B       1       50
2   C       1       170
3   D       1       80
4   All     7       1240

我可以合并两个不同的透视表-一个包含城市计数，另一个包含人口，但对于具有大型aggfunc字典的大型数据集，这是不可伸缩的。

解决方案

另一种不使用pivot_table的方法。将np.where与groupby+agg配合使用：

df['Condition'] = np.where(df['Condition']=='Good', df['City'], np.nan)
df = df.groupby('Area').agg({'City':'nunique', 'Condition':'nunique', 'Population':'sum'})
                       .rename(columns={'City':'city_count', 'Condition':'good_city_count'})
df.loc['All',:] = df.sum()
df = df.astype(int).reset_index()

print(df)
  Area  city_count  good_city_count  Population
0    A           4                2         940
1    B           1                1          50
2    C           1                1         170
3    D           1                1          80
4  All           7                5        1240

相关文章