在 Pandas 中查找与给定时间最近的 DataFrame 行

2022-01-11 00:00:00 python pandas datetime time-series

问题描述

我有一个由 DatetimeIndex 索引的 Pandas 数据框:

I have a Pandas dataframe which is indexed by a DatetimeIndex:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values
Julian_Day                          53732  non-null values
AOT_870                             53732  non-null values
440-870Angstrom                     53732  non-null values
440-675Angstrom                     53732  non-null values
500-870Angstrom                     53732  non-null values
Last_Processing_Date(dd/mm/yyyy)    53732  non-null values
Solar_Zenith_Angle                  53732  non-null values
time                                53732  non-null values
dtypes: datetime64[ns](2), float64(6), object(1)

我想找到最接近某个时间的行:

I want to find the row that is closest to a certain time:

image_time = dateutil.parser.parse('2009-07-28 13:39:02')

并找出它有多接近.到目前为止,我已经尝试了各种基于从所有时间中减去我想要的时间并找到最小绝对值的想法,但似乎都没有奏效.

and find how close it is. So far, I have tried various things based upon the idea of subtracting the time I want from all of the times and finding the smallest absolute value, but none quite seem to work.

例如:

aeronet.index - image_time

给出一个错误,我认为这是由于日期时间索引上的 +/- 发生了变化,所以我尝试将索引放入另一列,然后进行处理:

Gives an error which I think is due to +/- on a Datetime index shifting things, so I tried putting the index into another column and then working on that:

aeronet['time'] = aeronet.index
aeronet.time - image_time

这似乎可行,但要做我想做的事,我需要获得绝对时间差,而不是相对差.但是,仅在其上运行 absnp.abs 会出现错误:

This seems to work, but to do what I want, I need to get the ABSOLUTE time difference, not the relative difference. However, just running abs or np.abs on it gives an error:

abs(aeronet.time - image_time)

C:Python27libsite-packagespandascoreseries.pyc in __repr__(self)
   1061         Yields Bytestring in Py2, Unicode String in py3.
   1062         """
-> 1063         return str(self)
   1064 
   1065     def _tidy_repr(self, max_vals=20):

C:Python27libsite-packagespandascoreseries.pyc in __str__(self)
   1021         if py3compat.PY3:
   1022             return self.__unicode__()
-> 1023         return self.__bytes__()
   1024 
   1025     def __bytes__(self):

C:Python27libsite-packagespandascoreseries.pyc in __bytes__(self)
   1031         """
   1032         encoding = com.get_option("display.encoding")
-> 1033         return self.__unicode__().encode(encoding, 'replace')
   1034 
   1035     def __unicode__(self):

C:Python27libsite-packagespandascoreseries.pyc in __unicode__(self)
   1044                     else get_option("display.max_rows"))
   1045         if len(self.index) > (max_rows or 1000):
-> 1046             result = self._tidy_repr(min(30, max_rows - 4))
   1047         elif len(self.index) > 0:
   1048             result = self._get_repr(print_header=True,

C:Python27libsite-packagespandascoreseries.pyc in _tidy_repr(self, max_vals)
   1069         """
   1070         num = max_vals // 2
-> 1071         head = self[:num]._get_repr(print_header=True, length=False,
   1072                                     name=False)
   1073         tail = self[-(max_vals - num):]._get_repr(print_header=False,

AttributeError: 'numpy.ndarray' object has no attribute '_get_repr'

我是否以正确的方式处理这个问题?如果是这样,我应该如何让 abs 工作,以便我可以选择最小绝对时差,从而获得最接近的时间.如果不是,那么使用 Pandas 时间序列的最佳方法是什么?

Am I approaching this the right way? If so, how should I get abs to work, so that I can then select the minimum absolute time difference, and thus get the closest time. If not, what is the best way to do this with a Pandas time-series?


解决方案

我认为您可以尝试 DatetimeIndex.asof 来查找包含输入的最新标签.然后使用返回的日期时间选择适当的行.如果您只需要特定列的值,Series.asof 存在并将上述两个步骤合二为一.

I think you can try DatetimeIndex.asof to find the most recent label up to and including the input. Then use the returned datetime to select the appropriate row. If you only need values for a particular column, Series.asof exists and combines the two steps above into one.

这假设您想要最接近的日期时间.如果你不关心日期,只想要每天相同的时间,请在 DataFrame 中使用 at_time.

This assumes you want the closest datetime. If you don't care about the date and just want the same time every day, use at_time in DataFrame.

误报,我在本地有一个旧版本.master 上的最新版本应该可以与 np.abs 一起使用.

false alarm, I had an older version locally. The latest on master should work with np.abs.

In [10]: np.abs(df.time - image_time)
Out[10]: 
0    27 days, 13:39:02
1    26 days, 13:39:02
2    25 days, 13:39:02
3    24 days, 13:39:02
4    23 days, 13:39:02
5    22 days, 13:39:02

也只是为了澄清:

aeronet.index - image_time 不起作用,因为 Index 上的减法是一个设定的差异(过去 Index 曾经被限制为唯一).

aeronet.index - image_time doesn't work because subtraction on Index is a set difference (back in the day Index used to be constrained to be unique).

相关文章