选择由 DatetimeIndex 索引的 Pandas DataFrame 的子集和时间戳列表

2022-01-11 00:00:00 python pandas time-series

问题描述

我有一个大熊猫 DataFrame

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3425100 entries, 2011-12-01 00:00:00 to 2011-12-31 23:59:59
Data columns:
sig_qual    3425100  non-null values
heave       3425100  non-null values
north       3425099  non-null values
west        3425097  non-null values
dtypes: float64(4)

我使用 .ix[start_datetime:end_datetime] 选择该 DataFrame 的子集,然后将其传递给 peakdetect 函数,它返回两个单独列表中局部最大值和最小值的索引和值.我提取最大值的索引位置并使用 DataFrame.index 我得到一个熊猫时间戳列表.

I select a subset of that DataFrame using .ix[start_datetime:end_datetime] and I pass this to a peakdetect function which returns the index and value of the local maxima and minima in two seperate lists. I extract the index position of the maxima and using DataFrame.index I get a list of pandas TimeStamps.

然后我尝试通过将时间戳列表传递给 .ix[] 来提取大型 DataFrame 的相关子集,但它似乎总是返回一个空的 DataFrame.我可以遍历 TimeStamps 列表并从 DataFrame 获取相关行,但这是一个漫长的过程,我认为 ix[] 应该接受一个值列表文档?(虽然我看到 Pandas 0.7 的示例使用 numpy.datetime64numpy.ndarray)

I then attempt to extract the relevant subset of the large DataFrame by passing the list of TimeStamps to .ix[] but it always seems to return an empty DataFrame. I can loop over the list of TimeStamps and get the relevant rows from the DataFrame but this is a lengthy process and I thought that ix[] should accept a list of values according to the docs? (Although I see that the example for Pandas 0.7 uses a numpy.ndarray of numpy.datetime64)

更新:下面选择了一小段 8 秒的 DataFrame 子集,# 行显示了一些值:

Update: A small 8 second subset of the DataFrame is selected below, # lines show some of the values:

y = raw_disp['heave'].ix[datetime(2011,12,30,0,0,0):datetime(2011,12,30,0,0,8)]
#csv representation of y time-series 
2011-12-30 00:00:00,-310.0
2011-12-30 00:00:01,-238.0
2011-12-30 00:00:01.500000,-114.0
2011-12-30 00:00:02.500000,60.0
2011-12-30 00:00:03,185.0
2011-12-30 00:00:04,259.0
2011-12-30 00:00:04.500000,231.0
2011-12-30 00:00:05.500000,139.0
2011-12-30 00:00:06.500000,55.0
2011-12-30 00:00:07,-49.0
2011-12-30 00:00:08,-144.0

index = y.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-30 00:00:00, ..., 2011-12-30 00:00:08]
Length: 11, Freq: None, Timezone: None

#_max returned from the peakdetect function, one local maxima for this 8 seconds period
_max = [[5, 259.0]]

indexes = [x[0] for x in _max]
#[5]

timestamps = [index[z] for z in indexes]
#[<Timestamp: 2011-12-30 00:00:04>]

print raw_disp.ix[timestamps]
#Empty DataFrame
#Columns: array([sig_qual, heave, north, west, extrema], dtype=object)
#Index: <class 'pandas.tseries.index.DatetimeIndex'>
#Length: 0, Freq: None, Timezone: None

for timestamp in timestamps:
    print raw_disp.ix[timestamp]
#sig_qual      0
#heave       259
#north        27
#west        132
#extrema       0
#Name: 2011-12-30 00:00:04

更新 2:我 创建了一个 gist,这实际上是有效的,因为当从 csv 加载数据时,时间戳的索引列存储为对象的 numpy 数组,这些对象似乎是字符串.与我自己的代码不同,其中索引的类型为 <class 'pandas.tseries.index.DatetimeIndex'> 并且每个元素的类型为 <class 'pandas.lib.Timestamp'>,我认为传递 pandas.lib.Timestamp 列表与传递单个时间戳的工作方式相同,这会被视为错误吗?

Update 2: I created a gist, which actually works because when the data is loaded in from csv the index columns of timestamps are stored as numpy array of objects which appear to be strings. Unlike in my own code where the index is of type <class 'pandas.tseries.index.DatetimeIndex'> and each element is of type <class 'pandas.lib.Timestamp'>, I thought passing a list of pandas.lib.Timestamp would work the same as passing individual timestamps, would this be considered a bug?

如果我创建原始 DataFrame 并将索引作为字符串列表,则使用字符串列表进行查询可以正常工作.但它确实会显着增加 DataFrame 的字节大小.

If I create the original DataFrame with the index as a list of strings, querying with a list of strings works fine. It does increase the byte size of the DataFrame significantly though.

更新 3:该错误似乎只发生在非常大的 DataFrame 上,我在不同大小的 DataFrame 上重新运行了代码(下面的评论中有一些细节),它似乎发生在超过 270 万条记录的 DataFrame 上.使用字符串而不是时间戳可以解决问题,但会增加内存使用量.

Update 3: The error only appears to occur with very large DataFrames, I reran the code on varying sizes of DataFrame ( some detail in a comment below ) and it appears to occur on a DataFrame above 2.7 million records. Using strings as opposed to TimeStamps resolves the issue but increases memory usage.

已修复在最新的 github master (18/09/2012) 中,请参阅页面底部 Wes 的评论.

Fixed In latest github master (18/09/2012), see comment from Wes at bottom of page.


解决方案

df.ix[my_list_of_dates] 应该可以正常工作.

df.ix[my_list_of_dates] should work just fine.

In [193]: df
Out[193]:
            A  B  C  D
2012-08-16  2  1  1  7
2012-08-17  6  4  8  6
2012-08-18  8  3  1  1
2012-08-19  7  2  8  9
2012-08-20  6  7  5  8
2012-08-21  1  3  3  3
2012-08-22  8  2  3  8
2012-08-23  7  1  7  4
2012-08-24  2  6  0  6
2012-08-25  4  6  8  1

In [194]: row_pos = [2, 6, 9]

In [195]: df.ix[row_pos]
Out[195]:
            A  B  C  D
2012-08-18  8  3  1  1
2012-08-22  8  2  3  8
2012-08-25  4  6  8  1

In [196]: dates = [df.index[i] for i in row_pos]

In [197]: df.ix[dates]
Out[197]:
            A  B  C  D
2012-08-18  8  3  1  1
2012-08-22  8  2  3  8
2012-08-25  4  6  8  1

相关文章