将 pandas.DataFrame 转换为字节

问题描述

我需要将存储在 pandas.DataFrame 中的数据转换为字节字符串,其中每列可以具有单独的数据类型(整数或浮点数).这是一组简单的数据:

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:

df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')

df 看起来像这样:

    a            b                  c
0   10  18446744073709551615    1.324000e+10
1   15  230498234019            3.141590e+00
2   20  32094812309             2.341341e+02

DataFrame 知道每一列 df.dtypes 的类型,所以我想做这样的事情:

The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this:

data_to_pack = [tuple(record) for _, record in df.iterrows()]
data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
data_bytes = data_array.tostring()

这通常可以正常工作,但在这种情况下(由于 df['b'][0] 中存储的最大值.上面的第二行将元组数组转换为 具有给定类型集的 np.array 会导致以下错误:

This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.array with a given set of types causes the following error:

OverflowError: Python int too large to convert to C long

第一行中的错误结果(我相信)将记录提取为具有单一数据类型(默认为 float64)的 Series 和在float64 的最大 uint64 值不能直接转换回 uint64.

The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64.

1) 由于 DataFrame 已经知道每一列的类型,因此有办法绕过创建一行元组以输入到类型化的 numpy.array 构造函数中?或者有没有比上面概述的更好的方法来保存这种转换中的类型信息?

1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? Or is there a better way than outlined above to preserve the type information in such a conversion?

2) 有没有办法直接从 DataFrame 到使用每列的类型信息表示数据的字节字符串.

2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column.


解决方案

可以使用df.to_records() 将您的数据帧转换为 numpy recarray,然后调用 .tostring() 到将其转换为字节串:

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes:

rec = df.to_records(index=False)

print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
#  (20, 32094812309, 234.1341)], 
#           dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])

s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)

print(np.all(rec2 == rec))
# True

相关文章