逐行比较两个不同长度的数据帧,并为每行添加具有相等值的列
问题描述
我在 python pandas 中有两个不同长度的数据框,如下所示:
I have two dataframes of different length in python pandas like this:
df1: df2:
Column1 Column2 Column3 ColumnA ColumnB
0 1 a r 0 1 a
1 2 b u 1 1 d
2 3 c k 2 1 e
3 4 d j 3 2 r
4 5 e f 4 2 w
5 3 y
6 3 h
我现在要做的是比较 df1 的 Column1 和 df2 的 ColumnA.对于每个命中",其中 df2 中的 ColumnA 中的一行与 df1 中 Column1 中的一行具有相同的值,我想将一列附加到 df1,其中 df2 的 ColumnB 对命中"的行具有,所以我的结果如下所示:
What I am trying to do now is comparing Column1 of df1 and ColumnA of df2. For each "hit", where a row in ColumnA in df2 has the same value as a row in Column1 in df1, I want to append a column to df1 with the vaule ColumnB of df2 has for the row where the "hit" was found, so that my result looks like this:
df1:
Column1 Column2 Column3 Column4 Column5 Column6
0 1 a r a d e
1 2 b u r w
2 3 c k y h
3 4 d j
4 5 e f
到目前为止我尝试过的是:
What I have tried so far was:
for row in df1, df2:
if df1[Column1] == df2[ColumnA]:
print 'yey!'
这给了我一个错误,说我无法比较两个不同长度的数据帧.所以我尝试了:
which gave me an error saying I could not compare two dataframes of different length. So I tried:
for row in df1, df2:
if def2[def2['ColumnA'].isin(def1['column1'])]:
print 'lalala'
else:
print 'Nope'
就我获得输出而言,哪个有效",但我认为它不会遍历行并比较它们,因为它只打印 'lalala' 两次.于是我又研究了一番,找到了一种遍历数据框每一行的方法,即:
Which "works" in terms that I get an output, but I do not think it iterates over the rows and compares them, since it only prints 'lalala' two times. So I researched some more and found a way to iterate over each row of the dataframe, which is:
for index, row in df1.iterrows():
print row['Column1]
但我不知道如何使用它来比较两个数据框的列并获得我想要的输出.
But I do not know how to use this to compare the columns of the two dataframes and get the output I desire.
非常感谢任何有关如何执行此操作的帮助.
Any help on how to do this would be really appreciated.
解决方案
我推荐你使用DataFrame API,它允许在加入,合并,groupby 等.您可以在下面找到我的解决方案:
I recommend you to use DataFrame API which allows to operate with DF in terms of join, merge, groupby, etc. You can find my solution below:
import pandas as pd
df1 = pd.DataFrame({'Column1': [1,2,3,4,5],
'Column2': ['a','b','c','d','e'],
'Column3': ['r','u','k','j','f']})
df2 = pd.DataFrame({'Column1': [1,1,1,2,2,3,3], 'ColumnB': ['a','d','e','r','w','y','h']})
dfs = pd.DataFrame({})
for name, group in df2.groupby('Column1'):
buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]})
i = 0
for index, value in group['ColumnB'].iteritems():
i += 1
string = 'Column_' + str(i)
buffer_df[string] = value
dfs = dfs.append(buffer_df)
result = pd.merge(df1, dfs, how='left', on='Column1')
print(result)
结果是:
Column1 Column2 Column3 Column_0 Column_1 Column_2
0 1 a r a d e
1 2 b u r w NaN
2 3 c k y h NaN
3 4 d j NaN NaN NaN
4 5 e f NaN NaN NaN
附:更多详情:
1) 对于 df2,我通过Column1"生成 groups.单个 group 是一个数据框.示例如下:
1) for df2 I produce groups by 'Column1'. The single group is a data frame. Example below:
Column1 ColumnB
0 1 a
1 1 d
2 1 e
2) 对于每个 group 我生成数据帧 buffer_df:
2) for each group I produce data frame buffer_df:
Column1 Column_0 Column_1 Column_2
0 1 a d e
3) 之后我创建 DF dfs:
3) after that I create DF dfs:
Column1 Column_0 Column_1 Column_2
0 1 a d e
3 2 r w NaN
5 3 y h NaN
4) 最后我为 df1 和 dfs 执行左连接以获得所需的结果.
4) in the end I execute left join for df1 and dfs obtaining needed result.
2)* buffer_df 是迭代产生的:
2)* buffer_df is produced iteratively:
step0 (buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]})):
Column1
5 3
step1 (buffer_df['Column_0'] = group['ColumnB'][5]):
Column1 Column_0
5 3 y
step2 (buffer_df['Column_1'] = group['ColumnB'][5]):
Column1 Column_0 Column_1
5 3 y h
相关文章