如何将管道分隔转换为CSV或JSON

2022-03-14 00:00:00 python csv mongodb pymongo pipe

问题描述

我有一个以管道分隔的~4 GB txt文件。我正在尝试将此文本导入MongoDB，但如您所知，MongoDB只支持JSON和CSV文件。以下是目前为止的代码。

import pandas as pd
import csv
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
# Creating Database Office
db = client.Office
# Creating Collection Customers
customers = db.Customers

filename = "Names.txt"
data_df = pd.read_fwf(filename, sep="|", engine="python", encoding="latin-1")
fileout = "Names.csv"
output = data_df.to_csv(fileout, sep=",")
print("Finished")
fin = open("Names.csv", "r")
file_data = fin.read()
file_csv = csv.reader(file_data)
Customers.insert_many(file_csv)

输入文件"；Name.txt"；如下所示

Reg|Name|DOB|Friend|Nationality|Profession^M
1122|Sam|01/01/2001|John|USA|Lawyer^M
2456|George|05/10/1999|Pit|Canada|Engineer^M
5645|Brad|02/06/2000|Adam|UK|Doctor^M

如果提供的文本文件是CSV，则只需将其导入到MongoDB，或者如果txt文件是竖线分隔的或任何其他分隔的，则仅在将文本文件处理为CSV文件之后才将其导入到MongoDB。我在fileout中获得的CSV文件，当手动导入到MongoDB时，结果如下所示。

col1          col2
id    Reg|Name|DOB|Friend|Nationality|Profession
1     1122|Sam|01/01/2001|John|USA|Lawyer
2     2456|George|05/10/1999|Pit|Canada|Engineer
3     5645|Brad|02/06/2000|Adam|UK|Doctor

我要实现的目标如下所示。这是通过sed命令完成的。首先，我使用命令

将txt文件中的任何"；，"；if替换为"；-"；

sed -i 's/,/-/g' Names.txt

然后我将管道分隔符替换为"；，"；：

sed -i 's/|/,/g' Names.txt

col1 col2  col3   col4       col5    col6        col7
id   Reg   Name   DOB        Friend  Nationality Profession
1    1122  Sam    01/01/2001 John    USA         Lawyer
2    2456  George 05/10/1999 Pit     Canada      Engineer
3    5645  Brad   02/06/2000 Adam    UK          Doctor

我知道代码没有做任何事情。但我想不出怎么才能行得通。

我对所有类型的编程都不熟悉，我已经在网站上搜索了有关此问题和其他各种相关问题的各种答案，但没有一个符合我的需要。

更新

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

这是我从评论中获得一些想法后想出的新代码。但现在唯一的问题是我收到这个错误。

Traceback (most recent call last):
  File "E:Anaconda ProjectsMongo ProjectsOffice Toolcsvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:UsersPredatoranaconda3libjson\__init__.py", line 234, in dumps
    return cls(
  File "C:UsersPredatoranaconda3libjsonencoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

解决方案

最终找到解决方案。

我在一个5 GB的文件上测试了它，虽然速度很慢，但它仍然可以工作。它将所有数据从管道分隔的txt文件导入到MongoDB。

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

感谢大家的创意

相关文章