Python - 定位最近的时间戳

2022-01-13 00:00:00 python algorithm timestamp search

问题描述

我有一个 Python 日期时间时间戳和一个大字典(索引),其中键是时间戳,值是我感兴趣的其他一些信息.

I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in.

我需要尽可能高效地在索引中找到最接近时间戳的日期时间(键).

I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.

目前我正在做类似的事情:

At the moment I'm doing something like:

for timestamp in timestamps:
    closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))

这可行,但耗时太长 - 我的索引字典有数百万个值,我正在搜索数千次.我对数据结构等很灵活 - 时间戳大致是连续的,所以我从第一个时间戳迭代到最后一个时间戳.同样,我加载到字典中的文本文件中的时间戳也是连续的.

which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.

任何关于优化的想法都将不胜感激.

Any ideas for optimisation would be greatly appreciated.


解决方案

没有组织字典以进行有效的接近未命中搜索.它们专为精确匹配而设计(使用 哈希表).

Dictionaries aren't organized for efficient near miss searches. They are designed for exact matches (using a hash table).

您最好维护一个单独的、可快速搜索的有序结构.

You may be better-off maintaining a separate, fast-searchable ordered structure.

一个简单的开始方法是使用 bisect 模块对于快速 O(log N) 搜索但较慢 O(n) 插入:

A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:

def nearest(ts):
    # Given a presorted list of timestamps:  s = sorted(index)
    i = bisect_left(s, ts)
    return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))

适用于非静态、动态更新的字典的更复杂的方法是使用 blist 它采用树结构进行快速 O(log N) 插入和查找.只有当 dict 会随着时间而改变时,你才需要这个.

A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.

如果您想继续使用基于字典的方法,请考虑将具有附近时间戳的条目聚集在一起的 dict-of-lists:

If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:

 def get_closest_stamp(ts):
      'Speed-up timestamp search by looking only at entries in the same hour'
      hour = round_to_nearest_hour(ts)
      cluster = daydict[hour]         # return a list of entries
      return min(cluster, key=lambda t: abs(ts - t))

注意,对于靠近集群边界的准确结果,请在主集群和相邻集群中存储接近边界的时间戳.

Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.

相关文章