为什么 NUMPY correlate 和 corrcoef 返回不同的值以及如何“标准化"?“完整"中的相关性模式?
问题描述
我正在尝试在 Python 中使用 Numpy 进行时间序列分析.
I'm trying to use some Time Series Analysis in Python, using Numpy.
我有两个中等大小的系列,每个都有 20k 值,我想检查滑动相关性.
I have two somewhat medium-sized series, with 20k values each and I want to check the sliding correlation.
corrcoef 给了我一个自相关/相关系数矩阵作为输出.在我的情况下,它本身没有任何用处,因为其中一个系列包含滞后.
The corrcoef gives me as output a Matrix of auto-correlation/correlation coefficients. Nothing useful by itself in my case, as one of the series contains a lag.
correlate 函数(在 mode="full" 中)返回一个 40k 元素列表,该列表看起来确实像我想要的结果(峰值与列表中心一样远,因为滞后表明),但这些值都很奇怪 - 最多 500,而我期待的是从 -1 到 1 的值.
The correlate function (in mode="full") returns a 40k elements list that DO look like the kind of result I'm aiming for (the peak value is as far from the center of the list as the Lag would indicate), but the values are all weird - up to 500, when I was expecting something from -1 to 1.
我不能把它全部除以最大值;我知道最大相关性不是 1.
I can't just divide it all by the max value; I know the max correlation isn't 1.
如何标准化互相关"(完全"模式下的相关性),以便返回值将是每个滞后步骤的相关性,而不是那些非常大、奇怪的值?
How could I normalize the "cross-correlation" (correlation in "full" mode) so the return values would be the correlation on each lag step instead those very large, strange values?
解决方案
您正在寻找标准化互相关.此选项在 Numpy 中尚不可用,但 一个补丁 正在等待审核正是你想要的.我认为应用它应该不会太难.大多数补丁只是文档字符串的东西.它添加的唯一代码行是
You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are
if normalize:
a = (a - mean(a)) / (std(a) * len(a))
v = (v - mean(v)) / std(v)
其中 a 和 v 是输入的 numpy 数组,您将在其中找到互相关.将它们添加到您自己的 Numpy 发行版中或者只是制作相关函数的副本并在那里添加行应该不难.如果我选择走这条路,我会亲自做后者.
where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.
另一个可能更好的替代方法是在将输入向量发送到相关之前对其进行归一化.由您自己决定.
Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.
顺便说一句,根据 Wikipedia page on cross-,这似乎是正确的规范化相关性 除了除以 len(a)
而不是 (len(a)-1)
.我觉得这种差异类似于 样本的标准偏差与样本标准偏差,真的在我看来不会有太大的不同.
By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by len(a)
rather than (len(a)-1)
. I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.
相关文章