将 Python 序列(时间序列/数组)拆分为重叠的子序列
问题描述
我需要提取给定窗口的时间序列/数组的所有子序列.例如:
I need to extract all subsequences of a time series/array of a given window. For example:
>>> ts = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> window = 3
>>> subsequences(ts, window)
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6],
[5, 6, 7],
[5, 7, 8],
[6, 8, 9]])
遍历序列的朴素方法当然是昂贵的,例如:
Naive methods that iterate over the sequence are of course expensive, for example:
def subsequences(ts, window):
res = []
for i in range(ts.size - window + 1):
subts = ts[i:i+window]
subts.reset_index(drop=True, inplace=True)
subts.name = None
res.append(subts)
return pd.DataFrame(res)
我找到了一种更好的方法,方法是复制序列,将其移动一个不同的值直到窗口被覆盖,然后使用 reshape
拆分不同的序列.性能提高了大约 100 倍,因为 for 循环迭代的是窗口大小,而不是序列大小:
I found a better way by copying the sequence, shifting it by a different value until the window is covered, and splitting the different sequences with reshape
. Performance is around 100x better, because the for loop iterates over the window size, and not the sequence size:
def subsequences(ts, window):
res = []
for i in range(window):
subts = ts.shift(-i)[:-(ts.size%window)].reshape((ts.size // window, window))
res.append(subts)
return pd.DataFrame(np.concatenate(res, axis=0))
我已经看到pandas 在pandas.stats.moment 模块中包含了几个滚动函数,我猜它们所做的在某种程度上类似于子序列问题.该模块中是否有任何地方或 pandas 中的其他任何地方可以提高效率?
I've seen that pandas includes several rolling functions in the pandas.stats.moment module, and I guess what they do is somehow similar to the subsequencing problem. Is there anywhere in that module, or anywhere else in pandas to make this more efficient?
谢谢!
更新(解决方案):
基于@elyase 的回答,对于这个特定的案例,有一个稍微简单的实现,让我在这里写下来,并解释一下它在做什么:
Based on @elyase answer, for this specific case there is a slightly simpler implementation, let me write it down here, and explain what it's doing:
def subsequences(ts, window):
shape = (ts.size - window + 1, window)
strides = ts.strides * 2
return np.lib.stride_tricks.as_strided(ts, shape=shape, strides=strides)
给定一维 numpy 数组,我们首先计算结果数组的形状.我们将从数组的每个位置开始有一行,除了最后几个元素,从这些元素开始,接下来没有足够的元素来完成窗口.
Given the 1-D numpy array, we first compute the shape of the resulting array. We will have a row starting at each position of the array, with just the exception of the last few elements, at which starting them there wouldn't be enough elements next to complete the window.
参见本说明中的第一个示例,我们如何从最后一个数字开始是 6,因为从 7 开始,我们无法创建三个元素的窗口.因此,行数是大小减去窗口加一.列数就是窗口.
See on the first example in this description, how the last number we start at is 6, because starting at 7, we can't create a window of three elements. So, the number of rows is the size minus the window plus one. The number of columns is simply the window.
接下来,棘手的部分是告诉如何使用我们刚刚定义的形状填充结果数组.
Next, the tricky part is telling how to fill the resulting array, with the shape we just defined.
为此,我们认为第一个元素将是第一个.然后我们需要指定两个值(在两个整数的元组中作为参数 strides
的参数).这些值指定了我们需要在原始数组(一维数组)中执行的步骤以填充第二个数组(二维数组).
To do we consider that the first element will be the first. Then we need to specify two values (in a tuple of two integers as the argument to the parameter strides
). The values specify the steps we need to do in the original array (the 1-D one) to fill the second (the 2-D one).
考虑一个不同的例子,我们想要实现 np.reshape
函数,从一个 9 元素的一维数组到一个 3x3 数组.第一个元素填充第一个位置,然后,它右边的元素将成为一维数组中的下一个元素,因此我们移动 1 步.然后,棘手的部分,要填充第二行的第一个元素,我们应该执行 3 个步骤,从 0 到 4,请参阅:
Consider a different example, where we want to implement the np.reshape
function, from a 9 elements 1-D array, to a 3x3 array. The first element fills the first position, and then, the one at its right, would be the next on the 1-D array, so we move 1 step. Then, the tricky part, to fill the first element of the second row, we should do 3 steps, from the 0 to the 4, see:
>>> original = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> new = array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8])]
因此,对于 reshape
,我们对二维的步骤将是 (1, 3)
.对于我们的例子,它存在重叠,它实际上更简单.当我们向右移动以填充结果数组时,我们从一维数组中的下一个位置开始,当我们向右移动时,我们再次获得一维数组中的下一个元素,即 1 步.因此,步骤将是 (1, 1)
.
So, to reshape
, our steps for the two dimensions would be (1, 3)
. For our case, where it exists overlap, it is actually simpler. When we move right to fill the resulting array, we start at the next position in the 1-D array, and when we move right, again we get the next element, so 1 step, in the 1-D array. So, the steps would be (1, 1)
.
只有最后一件事需要注意.strides
参数不接受我们使用的步数",而是接受内存中的字节.要了解它们,我们可以使用 numpy 数组的 strides
方法.它返回一个带有步幅(以字节为单位的步数)的元组,每个维度都有一个元素.在我们的例子中,我们得到一个 1 元素的元组,我们想要它两次,所以我们有 * 2
.
There is only one last thing to note. The strides
argument does not accept the "steps" we used, but instead the bytes in memory. To know them, we can use the strides
method of numpy arrays. It returns a tuple with the strides (steps in bytes), with one element for each dimension. In our case we get a 1 element tuple, and we want it twice, so we have the * 2
.
np.lib.stride_tricks.as_strided
函数使用所描述的方法执行填充 不复制数据,这使得它非常高效.
The np.lib.stride_tricks.as_strided
function performs the filling using the described method without copying the data, which makes it quite efficient.
最后,请注意,此处发布的函数假定一个一维输入数组(这与一个具有 1 个元素作为行或列的二维数组不同).查看输入数组的 shape 方法,您应该得到类似 (N, )
而不是 (N, 1)
的内容.这种方法在后者上会失败.注意@elyase 发布的方法处理二维输入数组(这就是为什么这个版本稍微简单一些).
Finally, note that the function posted here assumes a 1-D input array (which is different from a 2-D array with 1 element as row or column). See the shape method of the input array, and you should get something like (N, )
and not (N, 1)
. This method would fail on the latter. Note that the method posted by @elyase handles two dimension input array (that's why this version is slightly simpler).
解决方案
这比我机器上的快速版本快 34 倍:
This is 34x faster than your fast version in my machine:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
>>> rolling_window(ts.values, 3)
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[4, 5, 6],
[5, 6, 7],
[6, 7, 8],
[7, 8, 9]])
归功于 Erik Rigtorp.
相关文章