分块更新大型 sqlite 数据库
我有一个 sqlite 数据库(大约 11 GB),它有多个表,包括表 distance
和 vertices
.distance
表很大(120 mio 行),vertices
较小(15 000 行).我想在 python 中使用 sqlite3 通过 vertices
中另一列的值来更新 distance
的一列.表顶点在列 cat
上有一个索引,在 orig_cat
上有另一个索引.
I have a sqlite database (appr. 11 GB) that has multiple tables including the tables distance
and vertices
. The table distance
is pretty large (120 mio rows), vertices
is smaller (15 000 rows). I want to use sqlite3 in python to update one column of distance
by values of another column in vertices
. The table vertices has an index on column cat
and another index on orig_cat
.
我在做什么:
import sqlite3
db_path='path/to/db.db'
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute('''UPDATE distance SET
from_orig_v = (SELECT orig_cat FROM vertices WHERE cat=distance.source)''')
但是,在如此大的数据库上运行该更新语句会导致内存错误.内存使用量稳步增加,直到崩溃.我正在寻找建议在不耗尽内存的情况下执行如此大的更新语句?也许以块的形式处理更新(即 distance
表的行)并在例如之后提交1000 次更新释放内存?这将如何在 python/sqlite 中完成?
However running that update statement on such a large database, causes a memory error. The memory usage is increasing steadily until it crashes. I am looking for advise to perform such a large update statement without running out of memory? Maybe processing the update in chunks (i.e. rows of distance
table) and committing after e.g. 1000 updates to free memory? How would that be done in python/sqlite?
推荐答案
应该可以用这样的语句更新块:
It should be possible to update chunks with statements like this:
UPDATE distance SET ... WHERE rowid BETWEEN 100000 AND 200000;
您不需要使用多个事务;实际上必须保存在内存中的唯一内容是要在单个语句中更新的行列表.(理论上,当内存耗尽时,您应该会收到相应的错误消息.在实践中,某些操作系统 overcommit memory 并且在为时已晚之前不要告诉应用程序.)
You don't need to use multiple transactions; the only thing that actually must be kept in memory is the list of rows to be updated in a single statement. (In theory, when memory runs out, you should get an appropriate error message. In practice, some OSes overcommit memory and don't tell the application about it until it's too late.)
相关文章