pymongo 中的快速或批量更新

2022-01-13 00:00:00 python mongodb pymongo nosql upsert

问题描述

如何在 pymongo 中进行批量更新插入?我想更新一堆条目,一次做一个非常慢.

How can I do a bulk upsert in pymongo? I want to Update a bunch of entries and doing them one at a time is very slow.

几乎相同问题的答案在这里:在 MongoDB 中批量更新/upsert?

The answer to an almost identical question is here: Bulk update/upsert in MongoDB?

接受的答案实际上并没有回答问题.它只是提供了 mongo CLI 的链接以进行导入/导出.

The accepted answer doesn't actually answer the question. It simply gives a link to the mongo CLI for doing import/exports.

我也愿意解释为什么做批量 upsert 是不可能的/不是最佳实践,但请解释解决此类问题的首选解决方案是什么.

I would also be open to someone explaining why doing a bulk upsert is no possible / no a best practice, but please explain what the preferred solution to this sort of problem is.


解决方案

现代版本的 pymongo(大于 3.x)将批量操作包装在一致的接口中,该接口在服务器版本不支持批量操作的情况下降级.这在 MongoDB 官方支持的驱动程序中现在是一致的.

Modern releases of pymongo ( greater than 3.x ) wrap bulk operations in a consistent interface that downgrades where the server release does not support bulk operations. This is now consistent in MongoDB officially supported drivers.

所以编码的首选方法是使用 bulk_write() 改为使用 UpdateOne 其他适当的操作动作代替.现在当然首选使用自然语言列表而不是特定的构建器

So the preferred method for coding is to use bulk_write() instead, where you use an UpdateOne other other appropriate operation action instead. And now of course it is preferred to use the natural language lists rather than a specific builder

旧文档的直接翻译:

from pymongo import UpdateOne

operations = [
    UpdateOne({ "field1": 1},{ "$push": { "vals": 1 } },upsert=True),
    UpdateOne({ "field1": 1},{ "$push": { "vals": 2 } },upsert=True),
    UpdateOne({ "field1": 1},{ "$push": { "vals": 3 } },upsert=True)
]

result = collection.bulk_write(operations)

或者经典的文档转换循环:

Or the classic document transformation loop:

import random
from pymongo import UpdateOne

random.seed()

operations = []

for doc in collection.find():
    # Set a random number on every document update
    operations.append(
        UpdateOne({ "_id": doc["_id"] },{ "$set": { "random": random.randint(0,10) } })
    )

    # Send once every 1000 in batch
    if ( len(operations) == 1000 ):
        collection.bulk_write(operations,ordered=False)
        operations = []

if ( len(operations) > 0 ):
    collection.bulk_write(operations,ordered=False)

返回的结果是BulkWriteResult 将包含匹配和更新文档的计数器以及发生的任何更新插入"的返回 _id 值.

对于批量操作数组的大小存在一些误解.发送到服务器的实际请求不能超过 16MB BSON 限制,因为该限制也适用于发送到使用 BSON 格式的服务器的请求".

There is a bit of a misconception about the size of the bulk operations array. The actual request as sent to the server cannot exceed the 16MB BSON limit since that limit also applies to the "request" sent to the server which is using BSON format as well.

但是,这并不能控制您可以构建的请求数组的大小,因为实际操作无论如何只会以 1000 个批次发送和处理.唯一真正的限制是这 1000 条操作指令本身实际上并不会创建大于 16MB 的 BSON 文档.这确实是一项艰巨的任务.

However that does not govern the size of the request array that you can build, as the actual operations will only be sent and processed in batches of 1000 anyway. The only real restriction is that those 1000 operation instructions themselves do not actually create a BSON document greater than 16MB. Which is indeed a pretty tall order.

批量方法的一般概念是减少流量",因为一次发送许多东西并且只处理一个服务器响应.减少附加到每个更新请求的开销可以节省大量时间.

The general concept of bulk methods is "less traffic", as a result of sending many things at once and only dealing with one server response. The reduction of that overhead attached to every single update request saves lots of time.

相关文章