Pandas:基于局部最小值-最大值的数据之字形分割
问题描述
我有一个时间序列数据.生成数据
I have a timeseries data. Generating data
date_rng = pd.date_range('2019-01-01', freq='s', periods=400)
df = pd.DataFrame(np.random.lognormal(.005, .5,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
s = df['data1']
我想创建一条连接局部最大值和局部最小值的曲折线,它满足在 y 轴上,每个曲折的 |最高 - 最低值|
的条件线必须超过前一条之字形线距离的百分比(比如 20%),以及预先设定的值 k(比如 1.2)
I want to create a zig-zag line connecting between the local maxima and local minima, that satisfies the condition that on the y-axis, |highest - lowest value|
of each zig-zag line must exceed a percentage (say 20%) of the distance of the previous zig-zag line, AND a pre-stated value k (say 1.2)
我可以使用以下代码找到局部极值:
I can find the local extrema using this code:
# Find peaks(max).
peak_indexes = signal.argrelextrema(s.values, np.greater)
peak_indexes = peak_indexes[0]
# Find valleys(min).
valley_indexes = signal.argrelextrema(s.values, np.less)
valley_indexes = valley_indexes[0]
# Merge peaks and valleys data points using pandas.
df_peaks = pd.DataFrame({'date': s.index[peak_indexes], 'zigzag_y': s[peak_indexes]})
df_valleys = pd.DataFrame({'date': s.index[valley_indexes], 'zigzag_y': s[valley_indexes]})
df_peaks_valleys = pd.concat([df_peaks, df_valleys], axis=0, ignore_index=True, sort=True)
# Sort peak and valley datapoints by date.
df_peaks_valleys = df_peaks_valleys.sort_values(by=['date'])
但我不知道如何对其应用阈值条件.请告诉我如何申请这样的条件.
but I don't know how to apply the threshold condition to it. Please advise me on how to apply such condition.
由于数据可能包含数百万个时间戳,因此强烈建议进行高效计算
Since the data could contain million timestamps, an efficient calculation is highly recommended
为了更清楚的描述:
示例输出,来自我的数据:
Example output, from my data:
# Instantiate axes.
(fig, ax) = plt.subplots()
# Plot zigzag trendline.
ax.plot(df_peaks_valleys['date'].values, df_peaks_valleys['zigzag_y'].values,
color='red', label="Zigzag")
# Plot original line.
ax.plot(s.index, s, linestyle='dashed', color='black', label="Org. line", linewidth=1)
# Format time.
ax.xaxis_date()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))
plt.gcf().autofmt_xdate() # Beautify the x-labels
plt.autoscale(tight=True)
plt.legend(loc='best')
plt.grid(True, linestyle='dashed')
我想要的输出(与此类似,之字形仅连接重要段)
My desired output (something similar to this, the zigzag only connect the significant segments)
解决方案
我已经回答了我对问题的最佳理解.然而,变量 K 如何影响过滤器尚不清楚.
I have answered to my best understanding of the question. Yet it is not clear to how the variable K influences the filter.
您想根据运行条件过滤极值.我假设您要标记与最后一个标记极值的相对距离大于 p% 的所有极值.我进一步假设您始终将时间序列的第一个元素视为有效/相关点.
You want to filter the extrema based on a running condition. I assume that you want to mark all extrema whose relative distance to the last marked extremum is larger than p%. I further assume that you always consider the first element of the timeseries a valid/relevant point.
我使用以下过滤器功能实现了这一点:
I implemented this with the following filter function:
def filter(values, percentage):
previous = values[0]
mask = [True]
for value in values[1:]:
relative_difference = np.abs(value - previous)/previous
if relative_difference > percentage:
previous = value
mask.append(True)
else:
mask.append(False)
return mask
为了运行你的代码,我首先导入依赖:
To run your code, I first import dependencies:
from scipy import signal
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
为了使代码可复制,我修复了随机种子:
To make the code reproduceable I fix the random seed:
np.random.seed(0)
剩下的就是copypasta.请注意,我减少了样本量以使结果清晰.
The rest from here is copypasta. Note that I decreased the amount of sample to make the result clear.
date_rng = pd.date_range('2019-01-01', freq='s', periods=30)
df = pd.DataFrame(np.random.lognormal(.005, .5,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
s = df['data1']
# Find peaks(max).
peak_indexes = signal.argrelextrema(s.values, np.greater)
peak_indexes = peak_indexes[0]
# Find valleys(min).
valley_indexes = signal.argrelextrema(s.values, np.less)
valley_indexes = valley_indexes[0]
# Merge peaks and valleys data points using pandas.
df_peaks = pd.DataFrame({'date': s.index[peak_indexes], 'zigzag_y': s[peak_indexes]})
df_valleys = pd.DataFrame({'date': s.index[valley_indexes], 'zigzag_y': s[valley_indexes]})
df_peaks_valleys = pd.concat([df_peaks, df_valleys], axis=0, ignore_index=True, sort=True)
# Sort peak and valley datapoints by date.
df_peaks_valleys = df_peaks_valleys.sort_values(by=['date'])
然后我们使用过滤功能:
Then we use the filter function:
p = 0.2 # 20%
filter_mask = filter(df_peaks_valleys.zigzag_y, p)
filtered = df_peaks_valleys[filter_mask]
按照您之前的绘图以及新过滤的极值进行绘图:
And plot as you did both your previous plot as well as the newly filtered extrema:
# Instantiate axes.
(fig, ax) = plt.subplots(figsize=(10,10))
# Plot zigzag trendline.
ax.plot(df_peaks_valleys['date'].values, df_peaks_valleys['zigzag_y'].values,
color='red', label="Extrema")
# Plot zigzag trendline.
ax.plot(filtered['date'].values, filtered['zigzag_y'].values,
color='blue', label="ZigZag")
# Plot original line.
ax.plot(s.index, s, linestyle='dashed', color='black', label="Org. line", linewidth=1)
# Format time.
ax.xaxis_date()
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))
plt.gcf().autofmt_xdate() # Beautify the x-labels
plt.autoscale(tight=True)
plt.legend(loc='best')
plt.grid(True, linestyle='dashed')
编辑:
如果想同时认为第一个点和最后一个点都有效,那么您可以按如下方式调整过滤器功能:
If want to both consider the first as well as the last point as valid, then you can adapt the filter function as follows:
def filter(values, percentage):
# the first value is always valid
previous = values[0]
mask = [True]
# evaluate all points from the second to (n-1)th
for value in values[1:-1]:
relative_difference = np.abs(value - previous)/previous
if relative_difference > percentage:
previous = value
mask.append(True)
else:
mask.append(False)
# the last value is always valid
mask.append(True)
return mask
相关文章