如何从另一个数据框中获取值的行列名

2022-02-26 00:00:00 python pandas dataframe pandas-groupby

问题描述

给定范围表(start, end)

name     blue         green          yellow        purple              
a        1, 5                        654, 678       11, 15
b                     88761, 88776  
c        1211, 1215                  38, 47    
d        89, 95                                     1567, 1578

和数据框df

Supplier        colour                   
Abi             1                               
John            678          
Smith           120               
Tim             1570 
Don             87560                       

我如何对df进行过滤操作以仅包含其colour列中的值在表中提供的范围内的行?

并创建第三列,将列colour中的值与范围表中的row-column名称相关联。我希望最终df是这样的:

Supplier        colour    Source                  
Abi             1         a-blue                      
John            678       a-yellow                   
Tim             1570      d-purple

谢谢!

s = df2.stack()
print(s)

给予:

a  1,      name          5
           blue       654,
           green       678
           yellow      11,
           purple       15
b  88761,  name      88776
c  1211,   name       1215
           blue        38,
           green        47
d  89,     name         95
           blue      1567,
           green      1578
dtype: object

和:

s = df2.stack()
s = s.str.split("[, ]+").apply(lambda x: pd.Interval(*map(int, x), closed="both"))
print(s)

给予:

Traceback (most recent call last):
  File "/Users/PycharmProjects/sup.py", line 12, in <module>
    s = s.str.split("[, ]+").apply(lambda x: pd.Interval(*map(int,x), closed="both"))
  File "/Users/.conda/envs/lib/python3.8/site-packages/pandas/core/series.py", line 4200, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2402, in pandas._libs.lib.map_infer
  File "/Users/PycharmProjects/sup.py", line 12, in <lambda>
    s = s.str.split("[, ]+").apply(lambda x: pd.Interval(*map(int,x), closed="both"))
TypeError: 'float' object is not iterable

解决方案

首先通过replace()方法替换' 'NaN

df1=df1.replace(r's+',float('NaN'),regex=True)
                  #^ it will replace one or more occurence of ' '

那么我们的想法是将字符串范围设置为组合范围值的实际列表:

s=df1.set_index('name').stack().dropna().replace('',float('NaN')) 
#OR if the above line not working them try: 
#s=df1.set_index('name').stack().dropna().replace('',float('NaN'),regex=True) 
m=s.str.split(',').map(lambda x:range(int(x[0]),int(x[1])+1)).explode()
s=m.unique()
m.index=m.index.get_level_values(0)+'-'+m.index.get_level_values(1)
m=m.drop_duplicates()

最后:

out=df2[df2['colour'].isin(s)]
#Filtered out the data

现在创建源列:

out['Source']=m[m.isin(out['colour'].tolist())].index

out['Source']=out['colour'].map(dict(m.reset_index()[[0,'index']].values))

OUT的输出:

   Supplier     colour  source
0   Abi         1       a-blue
1   John        678     a-yellow
3   Tim         1570    d-purple

相关文章