如何在 Sqlite 中使用时间序列和快速时间范围查询?

假设我们用 Unix 时间戳列 ts 在 Sqlite 数据库中记录事件:

Let's say we log events in a Sqlite database with Unix timestamp column ts:

CREATE TABLE data(ts INTEGER, text TEXT);   -- more columns in reality

我们希望快速查找日期时间范围,例如:

and that we want fast lookup for datetime ranges, for example:

SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;

像这样,EXPLAIN QUERY PLAN 给出了不好的 SCAN TABLE data,所以一个明显的解决方案是 create an index with CREATE INDEX dt_idx ON data(ts).

Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).

那么问题就解决了,但是必须为我们已经增加的序列/已经排序的列 ts 维护一个索引是一个糟糕的解决方案可以直接在 O(log n) 中使用 B-tree 搜索.在内部这将是索引:

Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:

ts           rowid
1608000001   1
1608000002   2
1608000012   3
1608000077   4

这是对数据库空间的浪费(当查询必须首先查看索引时会浪费 CPU).

which is a waste of DB space (and CPU when a query has to look in the index first).

为了避免这种情况:

  • (1) 我们可以使用 ts 作为 INTEGER PRIMARY KEY,所以 ts 将是 rowid本身.但这失败了,因为 ts 不是唯一的:2 个事件可以在同一秒(甚至在同一毫秒)发生.

  • (1) we could use ts as INTEGER PRIMARY KEY, so ts would be the rowid itself. But this fails because ts is not unique: 2 events can happen at the same second (or even at the same millisecond).

例如查看 SQLite Autoincrement 中给出的信息.

See for example the info given in SQLite Autoincrement.

(2) 我们可以使用 rowid 作为时间戳 ts 与递增的数字连接.示例:

(2) we could use rowid as timestamp ts concatenated with an increasing number. Example:

 16087186540001      
 16087186540002
 [--------][--]
     ts     increasing number 

那么 rowid 是唯一的并且严格递增(假设每秒事件少于 10k),并且不需要索引.查询 WHERE ts BETWEEN a AND b 将简单地变为 WHERE rowid BETWEEN a*10000 AND b*10000+9999.

Then rowid is unique and strictly increasing (provided there are less than 10k events per second), and no index would be required. A query WHERE ts BETWEEN a AND b would simply become WHERE rowid BETWEEN a*10000 AND b*10000+9999.

但是有没有一种简单的方法可以让 Sqlite INSERT rowid 大于或等于给定值的项目?假设当前时间戳为 1608718654 并出现两个事件:

But is there an easy way to ask Sqlite to INSERT an item with a rowid greater than or equal to a given value? Let's say the current timestamp is 1608718654 and two events appear:

  CREATE TABLE data(ts_and_incr INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT);
  INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello")  #16087186540001 
  INSERT INTO data VALUES (NEXT_UNUSED(1608718654), "hello")  #16087186540002

更一般地说,如何使用 Sqlite 以最佳方式创建时间序列,以实现快速查询WHERE timestamp BETWEEN a AND b?

推荐答案

第一个解决方案

问题中详述的方法(2)似乎效果很好.在基准测试中,我获得了:

First solution

The method (2) detailed in the question seems to work well. In a benchmark, I obtained:

  • 简单的方法,没有索引:18 MB 数据库,86 ms 查询时间
  • 简单的方法,索引:32 MB 数据库,12 毫秒查询时间
  • 方法 (2):18 MB 数据库,12 ms 查询时间

这里的关键是使用 dt 作为 INTEGER PRIMARY KEY,所以 它将是行 id 本身(另见 是主键需要的索引吗SQLite?),使用 B-tree,并且将 not 有另一个隐藏的 rowid 列.因此,我们避免了一个额外的索引,它会产生对应的 dt =>rowid:这里dt 是行id.

The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.

我们还使用 AUTOINCREMENT,它在内部创建一个 sqlite_sequence 表,用于跟踪最后添加的 ID.这在插入时很有用:因为两个事件可能具有相同的时间戳(以秒为单位)(即使使用毫秒或微秒的时间戳也是可能的,操作系统可能会截断精度),我们使用 timestamp*10000 之间的最大值last_added_ID + 1 以确保它是唯一的:

We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:

 MAX(?, (SELECT seq FROM sqlite_sequence) + 1)

代码:

import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0:  # timestamp increases of 1 second with probability 1%
        t += 1
    db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)


使用 WITHOUT ROWID

这是另一种带有 WITHOUT ROWID 的方法,它给出了 8 毫秒 查询时间.我们必须自己实现一个自动递增的 id,因为使用 WITHOUT ROWID 时 AUTOINCREMENT 不可用.
当我们想要使用 PRIMARY KEY(dt, another_column1, another_column2, id) 并避免有额外的 rowid 列时,WITHOUT ROWID 很有用.rowid 有一个 B-tree,(dt, another_column1, ...) 有一个 B-tree,我们只需要一个.


Using a WITHOUT ROWID table

Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.

db.executescript("""
    CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);

    CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
    
    CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
    """)

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
        t += 1
    db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)


粗略排序的 UUID

更一般地说,该问题与粗略排序"的 ID 有关.按日期时间.更多信息:


Roughly-sorted UUID

More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:

  • ULID(通用唯一字典排序标识符)
  • 雪花
  • MongoDB ObjectId
  • ULID (Universally Unique Lexicographically Sortable Identifier)
  • Snowflake
  • MongoDB ObjectId

所有这些方法都使用一个 ID:

All these methods use an ID which is:

[---- timestamp ----][---- random and/or incremental ----]

相关文章