pandas 时间序列重新采样和插值在一起
问题描述
我有带时间戳的传感器数据.由于技术细节的原因,我以大约一分钟的间隔从传感器获取数据.数据可能如下所示:
I have timestamped sensor data. Because of technical details, I get data from the sensors at approximately one minute intervals. The data may look like this:
tstamp val
0 2016-09-01 00:00:00 57
1 2016-09-01 00:01:00 57
2 2016-09-01 00:02:23 57
3 2016-09-01 00:03:04 57
4 2016-09-01 00:03:58 58
5 2016-09-01 00:05:00 60
现在,基本上,如果我能在准确的时间获得所有数据,我会非常高兴,但我没有.保存分布和每分钟都有数据的唯一方法是插值.例如,在行索引 1 和 2 之间有 83 秒,而在精确分钟获取值的自然方法是在两行数据之间进行插值(在本例中为 57,但事实并非如此无处不在).
Now, essentially, I would be extremely happy if I got all data at the exact minute, but I don't. The only way to conserve the distribution and have data at each minute is to interpolate. For example, between row indexes 1 and 2 there are 83 seconds, and the natural way to get a value at the exact minute is to interpolate between the two rows of data (in this case, it is 57, but that is not the case everywhere).
现在,我的方法是执行以下操作:
Right now, my approach is to do the following:
date = pd.to_datetime(df['measurement_tstamp'].iloc[0].date())
ts_d = df['measurement_tstamp'].dt.hour * 60 * 60 +
df['measurement_tstamp'].dt.minute * 60 +
df['measurement_tstamp'].dt.second
ts_r = np.arange(0, 24*60*60, 60)
data = scipy.interpolate.interp1d(x=ts_d, y=df['speed'].values)(ts_r)
req = pd.Series(data, index=pd.to_timedelta(ts_r, unit='s'))
req.index = date + req.index
但这对我来说感觉相当冗长和漫长.有一些出色的 pandas 方法可以进行重采样、舍入等.我整天都在阅读它们,但事实证明,没有任何东西可以按照我想要的方式进行插值.resample
像 groupby
一样工作,并平均落在一起的时间点.fillna
进行插值,但不是在 resample
已经通过平均改变数据之后.
But this feels rather drawn out and long to me. There are excellent pandas methods that do resampling, rounding, etc. I have been reading them all day, but it turns out that nothing does interpolation just the way I want it. resample
works like a groupby
and averages time points that fall together. fillna
does interpolation, but not after resample
has already altered the data by averaging.
是我遗漏了什么,还是我的方法是最好的?
Am I missing something, or is my approach the best there is?
为简单起见,假设我按天和传感器对数据进行分组,因此一次只插入来自一个传感器的 24 小时时段.
For simplicity, assume that I group the data by day, and by sensor, so only a 24 hour period from one sensor is interpolated at a time.
解决方案
d = df.set_index('tstamp')
t = d.index
r = pd.date_range(t.min().date(), periods=24*60, freq='T')
d.reindex(t.union(r)).interpolate('index').ix[r]
注意,periods=24*60
适用于每日数据,而不适用于问题中提供的样本.对于该示例,periods=6
将起作用.
Note, periods=24*60
works on daily data, not on the sample provided in the question. For that sample, periods=6
will work.
相关文章