pandas - 合并字符串列不起作用(错误?)

2021-12-27 00:00:00 python pandas merge mysql

我正在尝试在两个数据帧之间进行简单的合并.它们来自两个不同的 SQL 表,其中连接键是字符串:

<预><代码>>>>df1.col1.dtypedtype('O')>>>df2.col2.dtypedtype('O')

我尝试使用此方法合并它们:

<预><代码>>>>merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

内连接的结果是空的,这首先提示我可能没有交集的条目:

<预><代码>>>>合并res.shape(0, 19)

但是当我尝试匹配单个元素时,我看到了这种非常奇怪的行为.

# 在第二个数据框中选择随机元素>>>df2.iloc[5,:].col2'95498208100000'# 在第一个数据框中手动查找>>>df1[df1.col1 == '95498208100000']0 行 × 19 列# 空,鉴于上面的合并结果,这是有道理的# 现在寻找与整数相同的值>>>df1[df1.col1 == 95498208100000]1 行 × 19 列# 找到元素!?!

因此,列是用对象"dtype 定义的.将它们作为字符串搜索不会产生任何结果.以整数形式搜索它们确实会返回结果,我认为这就是上面合并不起作用的原因..

知道发生了什么吗?

这几乎就像 Pandas 将 df1.col1 转换为整数一样,因为它可以,即使它应该在匹配时被视为字符串.

(我尝试使用示例数据帧来复制它,但对于小示例,我没有看到这种行为.关于如何找到更具描述性的示例的任何建议也将不胜感激.)

解决方案

问题在于 object dtype 具有误导性.我认为这意味着所有项目都是字符串.但显然,在读取文件时,pandas 正在将一些元素转换为整数,而将其余部分保留为字符串.

解决方案是确保每个字段都是一个字符串:

<预><代码>>>>df1.col1 = df1.col1.astype(str)>>>df2.col2 = df2.col2.astype(str)

然后合并按预期工作.

(我希望有一种方法可以指定 strdtype...)

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

解决方案

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

相关文章