如何基于多个条件将1个 pandas 数据帧合并或合并到另一个 pandas 数据帧

2022-05-26 00:00:00 python numpy pandas dataframe numpy-ndarray

问题描述

我有两个数据帧：

df1和df2，df1用作df2的引用或查找文件。这意味着我们需要使用df1的每一行与df2的每一行进行匹配，然后将df1合并到df2中，然后输出新的df2。

df1：

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2：

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G
5   00  19860101    G
6   52  19820101    G
7   53  19820101    G
8   00  19500201    G

如何根据条件合并这两个数据帧：

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

请注意，最终输出是将df1合并到df2中，因为df1就像是df2的引用或查找文件。这意味着我们需要使用df1的每一行与df2的每一行进行匹配，然后输出新的df2

输出应如下所示：

df2：

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None
5   00  19860101    G   50
6   52  19820101    G   None
7   53  19820101    G   None
8   00  19500201    G   45

我知道有一种方法可以做到这一点，但它非常慢，特别是当d1的长度很大时：

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

有没有更快的解决方案？

解决方案

使用IntervalIndex-

idx = pd.IntervalIndex.from_arrays(df1['BeginDate'],df1['EndDate'],closed='both')
for x in df1['RB'].unique():
    mask = df2['RB']==x
    df2.loc[mask, 'Valindex0'] = df1.loc[idx.get_indexer(df2.loc[mask, 'IssueDate']), 'Valindex0'].values

输出

   RB  IssueDate gs  Valindex0
0  L3   19990201  8        NaN
1  00   19820101  G       47.0
2  48   19820101  G        NaN
3  50   19820101  G        NaN
4  50   19820101  G        NaN
5  00   19860101  G       50.0
6  52   19820101  G        NaN
7  53   19820101  G        NaN
8  00   19500201  G       45.0

相关文章