内存高效的内置 SqlAlchemy 迭代器/生成器?

2021-11-20 00:00:00 python sqlalchemy mysql

我有一个约 10M 的 MySQL 记录表,我使用 SqlAlchemy 连接它.我发现对这个表的大子集的查询会消耗太多内存,即使我认为我正在使用一个内置的生成器来智能地获取数据集的一口大小的块:

I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:

for thing in session.query(Things):
    analyze(thing)

为了避免这种情况,我发现我必须构建自己的迭代器来分块:

To avoid this, I find I have to build my own iterator that bites off in chunks:

lastThingID = None
while True:
    things = query.filter(Thing.id < lastThingID).limit(querySize).all()
    if not rows or len(rows) == 0: 
        break
    for thing in things:
        lastThingID = row.id
        analyze(thing)

这是正常的还是我在 SA 内置生成器方面遗漏了什么?

Is this normal or is there something I'm missing regarding SA built-in generators?

这个问题的答案似乎表明内存消耗是不可预期的.

The answer to this question seems to indicate that the memory consumption is not to be expected.

推荐答案

大多数 DBAPI 实现在获取行时完全缓冲它们 - 所以通常,在 SQLAlchemy ORM 获得一个结果之前,整个结果集都在内存中.

Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.

但是,Query 的工作方式是在返回给您的对象之前默认完全加载给定的结果集.这里的基本原理是关于不仅仅是简单的 SELECT 语句的查询.例如,在连接到可能在一个结果集中多次返回相同对象标识的其他表时(常见于急切加载),完整的行集需要在内存中,以便可以返回正确的结果,否则集合等可能只是部分人口.

But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.

所以 Query 提供了一个选项来通过 yield_per().此调用将导致 Query 批量生成行,您可以在其中指定批量大小.正如文档所述,这仅适用于您不进行任何类型的急切加载集合的情况,因此基本上是您真的知道自己在做什么.此外,如果底层 DBAPI 预缓冲行,仍然会有内存开销,因此该方法仅比不使用它稍微好一点.

So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.

我几乎从不使用yield_per();相反,我使用您上面建议的使用窗口函数的 LIMIT 方法的更好版本.LIMIT 和 OFFSET 有一个巨大的问题,即非常大的 OFFSET 值会导致查询变得越来越慢,因为 N 的 OFFSET 会导致它翻阅 N 行——这就像执行相同的查询五十次而不是一次,每次读取一个越来越多的行.使用窗口函数方法,我预取一组窗口"值,这些值指的是我想要选择的表块.然后我发出单独的 SELECT 语句,每个语句一次从这些窗口中的一个中拉出.

I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.

窗口函数方法是在维基上,我用它很好成功.

The window function approach is on the wiki and I use it with great success.

另请注意:并非所有数据库都支持窗口函数;您需要 Postgresql、Oracle 或 SQL Server.恕我直言,至少使用 Postgresql 绝对值得 - 如果您使用的是关系数据库,您不妨使用最好的.

Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.

相关文章