使用 Pandas 为每个过滤器识别列中最接近的值

2022-01-19 00:00:00 python pandas group-by boolean closest

问题描述

我有一个包含类别和值的数据框.我需要在每个类别中找到最接近某个值的值.我想我已经接近了,但在将 argsort 的结果应用于原始数据帧时,我无法真正获得正确的输出.

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.

例如,如果输入是在下面的代码中定义的,则输出应该只有 (a, 1, True), (b, 2, True),(c, 2, True) 和所有其他 isClosest Values 应为 False.

For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.

如果多个值最接近,那么它应该是列出的第一个标记的值.

If multiple values are closest then it should be the first value listed marked.

这是我拥有的代码,但我无法让它正确地重新应用于数据帧.我会喜欢一些指针.

Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.

df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False

uniqueCategories = df['category'].unique()
for c in uniqueCategories:
    filteredCategories = df[df['category']==c]    
    sortargs = (filteredCategories['value']-2.0).abs().argsort()
    #how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?


解决方案

可以创建一列绝对差值:

You can create a column of absolute differences:

df['dif'] = (df['values'] - 2).abs()

df
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
2        b       3    1
3        b       4    2
4        c       5    3
5        a       4    2
6        b       3    1
7        c       2    0
8        c       1    1
9        a       0    2

然后用groupby.transform检查每组的最小值是否等于你计算的差:

And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:

df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']

df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

df.groupby('category')['dif'].idxmin() 还会为您提供每个类别的最接近值的索引.您也可以将其用于映射.

df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.

供选择:

df.loc[df.groupby('category')['dif'].idxmin()]
Out: 
  category  values  dif
0        a       1    1
1        b       2    0
7        c       2    0

分配:

df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out: 
  category  values  dif is_closest
0        a       1    1       True
1        b       2    0       True
2        b       3    1      False
3        b       4    2      False
4        c       5    3      False
5        a       4    2      False
6        b       3    1      False
7        c       2    0       True
8        c       1    1      False
9        a       0    2      False

这些方法之间的区别在于,如果您根据差异检查相等性,则在出现平局的情况下,所有行都会得到 True.但是,使用 idxmin 它将在第一次出现时返回 True(每个组只有一个).

The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).

相关文章