将不规则时间戳的测量值转换为等间距的时间加权平均值

2022-01-11 00:00:00 python pandas time-series

问题描述

我有一系列带有时间戳且间隔不规则的测量值.这些系列中的值始终代表测量值的变化——即没有变化就没有新值.此类系列的一个简单示例是:

I have series of measurements which are time stamped and irregularly spaced. Values in these series always represent changes of the measurement -- i.e. without a change no new value. A simple example of such a series would be:

23:00:00.100 10 23:00:01.200 8 23:00:01.600 0 23:00:06.300 4

我想要达到的是一系列等距的时间加权平均值.对于给定的示例，我可能会针对基于秒的频率，因此结果如下:

What I want to reach is an equally spaced series of time-weighted averages. For the given example I might aim at a frequency based on seconds and hence a result like the following:

23:00:01 NaN ( the first 100ms are missing ) 23:00:02 5.2 ( 10*0.2 + 8*0.4 + 0*0.4 ) 23:00:03 0 23:00:04 0 23:00:05 0 23:00:06 2.8 ( 0*0.3 + 4*0.7 )

我正在寻找解决该问题的 Python 库.对我来说，这似乎是一个标准问题，但目前我在 pandas 等标准库中找不到这样的功能.

I am searching for a Python library solving that problem. For me, this seems to be a standard problem, but I couldn't find such a functionality so far in standard libraries like pandas.

算法需要考虑两件事:

时间加权平均
在形成平均值时考虑当前间隔之前的值(甚至可能在领先者之前)

data.resample('S', fill_method='pad') # forming a series of seconds

做部分工作.为聚合提供用户定义的函数将允许形成时间加权平均值，但是因为忽略了区间的开始，所以这个平均值也是不正确的.更糟糕的是:系列中的孔被平均值填充，在上面的示例中导致第 3、4 和 5 秒的值不为零.

does parts of the work. Providing a user-defined function for aggregation will allow to form time-weighted averages, but because the beginning of the interval is ignored, this average will be incorrect too. Even worse: the holes in the series are filled with the average values, leading in the example from above to the values of seconds 3, 4 and 5 to be non zero.

data = data.resample('L', fill_method='pad') # forming a series of milliseconds data.resample('S')

以一定的准确性完成这个技巧，但是 - 取决于准确性 - 非常昂贵.就我而言，太贵了.

does the trick with a certain accurateness, but is -- depending on the accurateness -- very expensive. In my case, too expensive.

import pandas as pa import numpy as np from datetime import datetime from datetime import timedelta time_stamps=[datetime(2013,04,11,23,00,00,100000), datetime(2013,04,11,23,00,1,200000), datetime(2013,04,11,23,00,1,600000), datetime(2013,04,11,23,00,6,300000)] values = [10, 8, 0, 4] raw = pa.TimeSeries(index=time_stamps, data=values) def round_down_to_second(dt): return datetime(year=dt.year, month=dt.month, day=dt.day, hour=dt.hour, minute=dt.minute, second=dt.second) def round_up_to_second(dt): return round_down_to_second(dt) + timedelta(seconds=1) def time_weighted_average(data): end = pa.DatetimeIndex([round_up_to_second(data.index[-1])]) return np.average(data, weights=np.diff(data.index.append(end).asi8)) start = round_down_to_second(time_stamps[0]) end = round_down_to_second(time_stamps[-1]) range = pa.date_range(start, end, freq='S') data = raw.reindex(raw.index + range) data = data.ffill() data = data.resample('S', how=time_weighted_average)

解决方案

您可以使用 traces 来做到这一点.

You can do this with traces.

from datetime import datetime import traces ts = traces.TimeSeries(data=[ (datetime(2016, 9, 27, 23, 0, 0, 100000), 10), (datetime(2016, 9, 27, 23, 0, 1, 200000), 8), (datetime(2016, 9, 27, 23, 0, 1, 600000), 0), (datetime(2016, 9, 27, 23, 0, 6, 300000), 4), ]) regularized = ts.moving_average( start=datetime(2016, 9, 27, 23, 0, 1), sampling_period=1, placement='left', )

这会导致:

[(datetime(2016, 9, 27, 23, 0, 1), 5.2), (datetime(2016, 9, 27, 23, 0, 2), 0.0), (datetime(2016, 9, 27, 23, 0, 3), 0.0), (datetime(2016, 9, 27, 23, 0, 4), 0.0), (datetime(2016, 9, 27, 23, 0, 5), 0.0), (datetime(2016, 9, 27, 23, 0, 6), 2.8)]

相关文章