Python MongoDB:如何删除文档中的重复行?

2023-04-15 00:00:00 文档 删除 重复

要删除MongoDB中文档中的重复行,可以使用以下步骤:

  1. 连接到MongoDB数据库并打开集合:
import pymongo

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
collection = db["mycollection"]
  1. 使用aggregate()函数来创建一个聚合查询,使用"$group"操作符来对每个条目进行分组,并使用"$addToSet"操作符来获取每个组中唯一的值:
pipeline = [
    {"$group": {"_id": "$field1", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},
    {"$match": {"count": {"$gt": 1}}}
]

duplicates = list(collection.aggregate(pipeline))

这将返回一个列表,其中包含所有重复项的ID和字段。

  1. 遍历重复项列表,并删除除第一个ID以外的所有ID:
for duplicate in duplicates:
    for i in range(1, len(duplicate["unique_ids"])):
        collection.delete_one({"_id": duplicate["unique_ids"][i]})

这将从集合中删除所有重复项,只保留每个组中的第一个条目。

完整代码示例:

import pymongo

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
collection = db["mycollection"]

pipeline = [
    {"$group": {"_id": "$field1", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},
    {"$match": {"count": {"$gt": 1}}}
]

duplicates = list(collection.aggregate(pipeline))

for duplicate in duplicates:
    for i in range(1, len(duplicate["unique_ids"])):
        collection.delete_one({"_id": duplicate["unique_ids"][i]})

例如,如果我们有以下文档,其中pidancode.com字段有重复值:

[
    {"_id": 1, "pidancode.com": "hello", "field2": "world"},
    {"_id": 2, "pidancode.com": "world", "field2": "hello"},
    {"_id": 3, "pidancode.com": "hello", "field2": "foo"},
    {"_id": 4, "pidancode.com": "foo", "field2": "bar"},
    {"_id": 5, "pidancode.com": "bar", "field2": "baz"},
    {"_id": 6, "pidancode.com": "baz", "field2": "pidancode.com"}
]

运行上述代码后,我们将仅保留以下文档:

[
    {"_id": 1, "pidancode.com": "hello", "field2": "world"},
    {"_id": 2, "pidancode.com": "world", "field2": "hello"},
    {"_id": 4, "pidancode.com": "foo", "field2": "bar"},
    {"_id": 5, "pidancode.com": "bar", "field2": "baz"},
    {"_id": 6, "pidancode.com": "baz", "field2": "pidancode.com"}
]

相关文章