Pandas 将多个数据帧与时间戳索引对齐

2022-01-11 00:00:00 python pandas time-series concatenation

问题描述

在过去的几天里，这一直是我生活的祸根.我有许多 Pandas 数据框，其中包含频率不规则的时间序列数据.我尝试将它们对齐到单个数据框中.

This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.

下面是一些代码，具有代表性的数据帧，df1、df2 和 df3(我实际上有 n=5，我将不胜感激适用于所有 n>2) 的解决方案:

Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):

# df1, df2, df3 are given at the bottom import pandas as pd import datetime # I can align df1 to df2 easily df1aligned, df2aligned = df1.align(df2) # And then concatenate into a single dataframe combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 ) # Since I don't know any better, I then try to align df3 to combined_1_n_2 manually: combined_1_n_2.align(df3) error: Reindexing only valid with uniquely valued Index objects

我知道为什么会出现此错误，因此我删除了 combined_1_n_2 中的重复索引并重试:

I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:

combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first() combined_1_n_2.align(df3) # But stll get the same error error: Reindexing only valid with uniquely valued Index objects

为什么会出现此错误?即使这有效，它也是完全手动且丑陋的.如何对齐 >2 个时间序列并将它们组合在一个数据帧中?

Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?

数据:

df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]}, index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000', '2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ]) df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]}, index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000', '2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ]) df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]}, index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0] for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000', '2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])

解决方案

您的具体错误是由于 combined_1_n_2 的列名有重复(两列都将命名为价格").您可以重命名列，然后第二个对齐就可以了.

Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.

另一种方法是链接 join 运算符，该运算符合并索引上的帧，如下所示.

One alternative way would be to chain the join operator, which merges frames on the index, as below.

In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2') Out[23]: price price_1 price_2 2008-06-01 06:03:52.281000 NaN NaN 67.6560 2008-06-01 06:03:52.359000 NaN NaN 67.8750 2008-06-01 06:03:59.614000 62.1250 NaN NaN 2008-06-01 06:03:59.692000 62.2500 NaN NaN 2008-06-01 06:13:34.524000 NaN 241.0625 NaN 2008-06-01 06:13:34.602000 NaN 241.5000 NaN 2008-06-01 06:13:34.848000 NaN NaN 67.8125 2008-06-01 06:13:34.926000 NaN NaN 67.7500 2008-06-01 06:15:05.321000 NaN NaN 67.6875 2008-06-01 06:15:05.399000 NaN 241.3750 NaN 2008-06-01 06:15:05.399000 NaN 241.2500 NaN 2008-06-01 06:15:42.004000 62.2375 NaN NaN 2008-06-01 06:15:42.082000 NaN 241.3750 NaN 2008-06-01 06:15:42.083000 61.9250 NaN NaN 2008-06-01 06:17:01.654000 61.9125 NaN NaN

相关文章