GROUP BY+新列+基于条件的前一行抓取值
问题描述
我有这套
df = pd.DataFrame({'user':[1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4],
'date':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-15','1995-12-18','1995-12-19','1995-12-20','1995-12-23','1995-12-26','1995-12-27'],
'dc':['1995-09-02','1995-09-02','1995-10-02','1995-10-05','1995-10-05','1995-11-05','1995-11-05','1995-11-10','1995-11-10','1995-11-10','1995-12-10','1995-12-23','1995-12-23','1995-12-23','1995-12-23','1995-12-23'],
'tp':['s','c','f','s','c','c','f','s','c','s','f','s','s','c','s','f'],
'vt':['0','1','0','0','1','0','0','0','1','0','0','0','0','1','0','0'],
'c1':['1','5','0','2','3','9','3','2','0','5','5','6','4','0','6','0'],
'c2':['3','4','0','2','5','3','8','4','0','6','2','7','0','0','8','0'],
'c3':['5','5','2','5','6','4','2','4','4','6','3','4','3','8','2','7']})
df
这提供了:
user date dc tp vt c1 c2 c3
1 1995-09-01 1995-09-02 s 0 1 3 5
1 1995-09-02 1995-09-02 c 1 5 4 5
1 1995-10-03 1995-10-02 f 0 0 0 2
2 1995-10-04 1995-10-05 s 0 2 2 5
2 1995-10-05 1995-10-05 c 1 3 5 6
2 1995-11-07 1995-11-05 c 0 9 3 4
2 1995-11-08 1995-11-05 f 0 3 8 2
3 1995-11-09 1995-11-10 s 0 2 4 4
3 1995-11-10 1995-11-10 c 1 0 0 4
3 1995-11-15 1995-11-10 s 0 5 6 6
3 1995-12-18 1995-12-10 f 0 5 2 3
4 1995-12-19 1995-12-23 s 0 6 7 4
4 1995-12-20 1995-12-23 s 0 4 0 3
4 1995-12-23 1995-12-23 c 1 0 0 8
4 1995-12-26 1995-12-23 s 0 6 8 2
4 1995-12-27 1995-12-23 f 0 0 0 7
我想创建新列CREATE新列df[‘dc2’],其中groupby user,列df[‘dc2’]=df[‘dc’]。
然而,如果df[‘dc’]满足条件‘tp’=‘c’&;‘Vt’=1&;‘c1’=0&;‘c2’=0,
然后抓取前一个条目的日期(用户的原始数据)
#ie.对于用户3,在df[‘dc’]列上,如果我们查看条目‘tp’=‘c’&;‘vt’=1,我们可以看到它具有‘c1’=0和‘c2’=0, #因此df[‘dc2’]的值将(对于用户3)为‘1995-11-09’,而不是‘1995-11-10’
#ie.对于用户4,在df[‘dc’]列中,如果我们查看条目‘tp’=‘c’&;‘vt’=1,我们可以看到它具有‘c1’=0和‘c2’=0, 在这种情况下,df[‘dc2’]应该(对于用户4)是‘1995-12-20’,而不是‘1995-12-23’
以下是所需结果:
user date dc dc2 tp vt c1 c2 c3
1 1995-09-01 1995-09-02 1995-09-02 s 0 1 3 5
1 1995-09-02 1995-09-02 1995-09-02 c 1 5 4 5
1 1995-10-03 1995-10-02 1995-10-02 f 0 0 0 2
2 1995-10-04 1995-10-05 1995-10-05 s 0 2 2 5
2 1995-10-05 1995-10-05 1995-10-05 c 1 3 5 6
2 1995-11-07 1995-11-05 1995-11-05 c 0 9 3 4
2 1995-11-08 1995-11-05 1995-11-05 f 0 3 8 2
3 1995-11-09 1995-11-10 1995-11-09 s 0 2 4 4
3 1995-11-10 1995-11-10 1995-11-09 c 1 0 0 4
3 1995-11-15 1995-11-10 1995-11-09 s 0 5 6 6
3 1995-12-18 1995-12-10 1995-12-09 f 0 5 2 3
4 1995-12-19 1995-12-23 1995-12-20 s 0 6 7 4
4 1995-12-20 1995-12-23 1995-12-20 s 0 4 0 3
4 1995-12-23 1995-12-23 1995-12-20 c 1 0 0 8
4 1995-12-26 1995-12-23 1995-12-20 s 0 6 8 2
4 1995-12-27 1995-12-23 1995-12-20 f 0 0 0 7
解决方案
我们创建一个表示条件tp=c
&;vt=1
&;c1=0
&;c2=0
的布尔掩码,然后对列user
应用GROUPBY,并应用自定义转换函数f
,该函数根据条件选择前一行的值:
m = df['tp'].eq('c') & df['vt'].eq('1')
& df['c1'].eq('0') & df['c2'].eq('0')
f = lambda s: s.mask(~m.shift(-1, fill_value=False)).ffill().bfill()
df['dc2'] = df.groupby('user')['date'].apply(f).fillna(df['dc'])
user date dc tp vt c1 c2 c3 dc2
0 1 1995-09-01 1995-09-02 s 0 1 3 5 1995-09-02
1 1 1995-09-02 1995-09-02 c 1 5 4 5 1995-09-02
2 1 1995-10-03 1995-10-02 f 0 0 0 2 1995-10-02
3 2 1995-10-04 1995-10-05 s 0 2 2 5 1995-10-05
4 2 1995-10-05 1995-10-05 c 1 3 5 6 1995-10-05
5 2 1995-11-07 1995-11-05 c 0 9 3 4 1995-11-05
6 2 1995-11-08 1995-11-05 f 0 3 8 2 1995-11-05
7 3 1995-11-09 1995-11-10 s 0 2 4 4 1995-11-09
8 3 1995-11-10 1995-11-10 c 1 0 0 4 1995-11-09
9 3 1995-11-15 1995-11-10 s 0 5 6 6 1995-11-09
10 3 1995-12-18 1995-12-10 f 0 5 2 3 1995-11-09
11 4 1995-12-19 1995-12-23 s 0 6 7 4 1995-12-20
12 4 1995-12-20 1995-12-23 s 0 4 0 3 1995-12-20
13 4 1995-12-23 1995-12-23 c 1 0 0 8 1995-12-20
14 4 1995-12-26 1995-12-23 s 0 6 8 2 1995-12-20
15 4 1995-12-27 1995-12-23 f 0 0 0 7 1995-12-20
相关文章