使用基于另一列的GROUPBY的最小最大归一化来归一化一列数据帧
问题描述
数据帧如图所示
Name Job Salary
john painter 40000
peter engineer 50000
sam plumber 30000
john doctor 500000
john driver 20000
sam carpenter 10000
peter scientist 100000
如何按列名分组并对每个组上的薪资列应用规范化?
预期结果:
Name Job Salary
john painter 0.041666
peter engineer 0
sam plumber 1
john doctor 1
john driver 0
sam carpenter 0
peter scientist 1
我已尝试以下操作
data = df.groupby('Name').transform(lambda x: (x - x.min()) / x.max()- x.min())
但是,这会产生
Salary
0 -19999.960000
1 -50000.000000
2 -9999.333333
3 -19999.040000
4 -20000.000000
5 -10000.000000
6 -49999.500000
解决方案
您马上就到了。
>>> df
Name Job Salary
0 john painter 40000
1 peter engineer 50000
2 sam plumber 30000
3 john doctor 500000
4 john driver 20000
5 sam carpenter 10000
6 peter scientist 100000
>>>
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
所以基本上,您只是忘了用括号将x.max() - x.min()
括起来。
请注意,使用一系列矢量化操作可以更快地完成此操作。
>>> grouper = df.groupby('Name')['Salary']
>>> maxes = grouper.transform('max')
>>> mins = grouper.transform('min')
>>>
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
计时:
>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group
>>> df
Name Job Salary
0 0 painter 40000
1 0 engineer 50000
2 0 plumber 30000
3 0 doctor 500000
4 1 driver 20000
... ... ... ...
6995 1748 plumber 30000
6996 1749 doctor 500000
6997 1749 driver 20000
6998 1749 carpenter 10000
6999 1749 scientist 100000
[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
...: grouper = df.groupby('Name')['Salary']
...: maxes = grouper.transform('max')
...: mins = grouper.transform('min')
...: (df.Salary - mins)/(maxes - mins)
...:
...:
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
相关文章