在维护列数据类型的同时将行插入 pandas DataFrame

2022-01-22 00:00:00 python pandas dataframe append

问题描述

在保持列数据类型的同时,将新行插入现有 pandas DataFrame 的最佳方法是什么,同时为未指定的列提供用户定义的填充值?这是一个例子:

What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:

df = pd.DataFrame({
    'name': ['Bob', 'Sue', 'Tom'],
    'age': [45, 40, 10],
    'weight': [143.2, 130.2, 34.9],
    'has_children': [True, True, False]
})

假设我想添加一条只传递 nameage 的新记录.为了维护数据类型,我可以从 df 复制行,修改值,然后将 df 附加到副本,例如

Assume that I want to add a new record passing just name and age. To maintain data types, I can copy rows from df, modify values and then append df to the copy, e.g.

columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)

但这会将 bool 列转换为对象.

But that converts the bool column to an object.

这是一个非常老套的解决方案,感觉不是这样做的正确方法":

Here's a really hacky solution that doesn't feel like the "right way" to do this:

columns = ('name', 'age')
copy_df = df.loc[0:0].copy()

missing_remap = {
    'int64': 0,
    'float64': 0.0,
    'bool': False,
    'object': ''
}
for c in set(copy_df.columns).difference(columns)):
    copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]

new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42

我知道我一定错过了什么.

I know I must be missing something.


解决方案

如你所见,由于 NaNfloat,添加 NaN到一个系列可能会导致它被向上转换为 float 或转换为 object.您确定这不是一个理想的结果是正确的.

As you found, since NaN is a float, adding NaN to a series may cause it to be either upcasted to float or converted to object. You are right in determining this is not a desirable outcome.

没有直接的方法.我的建议是将您的输入行数据存储在字典中,并在附加之前将其与默认字典相结合.请注意,这是有效的,因为 pd.DataFrame.append 接受 dict 参数.

There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append accepts a dict argument.

在 Python 3.6 中,您可以使用语法 {**d1, **d2} 组合两个字典,并优先选择第二个.

In Python 3.6, you can use the syntax {**d1, **d2} to combine two dictionaries with preference for the second.

default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}

row = {'name': 'Cindy', 'age': 42}

df = df.append({**default, **row}, ignore_index=True)

print(df)

   age  has_children   name  weight
0   45          True    Bob   143.2
1   40          True    Sue   130.2
2   10         False    Tom    34.9
3   42         False  Cindy     0.0

print(df.dtypes)

age               int64
has_children       bool
name             object
weight          float64
dtype: object

相关文章