Python CSV 到 SQLite
我正在转换"一个大(~1.6GB)的 CSV 文件,并将 CSV 的特定字段插入到 SQLite 数据库中.基本上我的代码看起来像:
I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:
import csv, sqlite3
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
reader = csv.reader(open(filecsv.txt, "rb"))
for field1, field2, field3, field4, field5 in reader:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
一切都按我的预期工作,但有例外......它需要大量的时间来处理.我编码不正确吗?有没有更好的方法来实现更高的性能并完成我所需要的(只需将 CSV 的几个字段转换为 SQLite 表)?
Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?
**EDIT -- 我尝试按照建议直接将 csv 导入 sqlite,但结果我的文件在字段中包含逗号(例如 我的标题,逗号"
).这会导致导入错误.似乎有太多这样的事件无法手动编辑文件...
**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"
). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...
还有其他想法吗??**
any other thoughts??**
推荐答案
Chris 是对的 - 使用事务;将数据分成块然后存储.
Chris is right - use transactions; divide the data into chunks and then store it.
"...除非已经在一个事务中,否则每个 SQL 语句都会为其启动一个新事务.这是非常昂贵的,因为它需要为每个语句重新打开、写入和关闭日志文件.这可以通过用 BEGIN TRANSACTION; 和 END TRANSACTION; 语句包装 SQL 语句序列来避免这种情况.对于不改变数据库的语句也可以获得这种加速." - 来源:http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
"...还有另一个技巧可以用来加速 SQLite:事务.每当你必须做多个数据库写入时,把它们放在一个事务中.而不是每个写入(并锁定)文件并且每次发出写入查询时,写入只会在事务完成时发生一次." - 来源:SQLite 的可扩展性如何?
import csv, sqlite3, time
def chunks(data, rows=10000):
""" Divides the data into 10000 rows each """
for i in xrange(0, len(data), rows):
yield data[i:i+rows]
if __name__ == "__main__":
t = time.time()
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
csvData = csv.reader(open(filecsv.txt, "rb"))
divData = chunks(csvData) # divide into 10000 rows each
for chunk in divData:
cur.execute('BEGIN TRANSACTION')
for field1, field2, field3, field4, field5 in chunk:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
cur.execute('COMMIT')
print "
Time Taken: %.3f sec" % (time.time()-t)
相关文章