为什么 SQLAlchemy 插入 sqlite 比直接使用 sqlite3 慢 25 倍?

2021-12-08 00:00:00 python sqlalchemy orm sqlite

为什么这个简单的测试用例使用 SQLAlchemy 插入 100,000 行比直接使用 sqlite3 驱动程序慢 25 倍?我在实际应用中看到过类似的减速.我做错了什么吗?

Why is this simple test case inserting 100,000 rows 25 times slower with SQLAlchemy than it is using the sqlite3 driver directly? I have seen similar slowdowns in real-world applications. Am I doing something wrong?

#!/usr/bin/env python
# Why is SQLAlchemy with SQLite so slow?
# Output from this program:
# SqlAlchemy: Total time for 100000 records 10.74 secs
# sqlite3:    Total time for 100000 records  0.40 secs


import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine 
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    engine  = create_engine(dbname, echo=False)
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

def test_sqlalchemy(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
    DBSession.commit()
    print "SqlAlchemy: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn

def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"

if __name__ == '__main__':
    test_sqlalchemy(100000)
    test_sqlite3(100000)

我尝试了许多变体(参见 http://pastebin.com/zCmzDraU)

I have tried numerous variations (see http://pastebin.com/zCmzDraU )

推荐答案

SQLAlchemy ORM 使用 工作单元 同步数据库更改时的模式.这种模式远远超出了简单的数据插入".它包括使用属性检测系统接收分配给对象的属性,该系统在对象创建时跟踪对象的更改,包括在 身份映射 其作用是对于每一行 SQLAlchemy 必须检索其最后插入的 id"(如果尚未给出),并且还涉及要插入的行被扫描和排序对于需要的依赖项.对象也受到一定程度的簿记以保持所有这些运行,这对于大量的行来说可能会造成花费在大型数据结构上的过多时间,因此最好将这些分块.

The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple "inserts" of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its "last inserted id" if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it's best to chunk these.

基本上,工作单元是一种高度自动化,以便在没有显式持久性代码的情况下将复杂对象图持久化到关系数据库中的任务自动化,而这种自动化是有代价的.

Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.

所以 ORM 基本上不适合高性能批量插入.这就是 SQLAlchemy 有两个独立库的全部原因,如果您查看 http://docs.sqlalchemy.org/en/latest/index.html 您会看到索引页有两个不同的部分 - 一个用于 ORM,另一个用于 Core.不了解两者就无法有效地使用 SQLAlchemy.

So ORMs are basically not intended for high-performance bulk inserts. This is the whole reason why SQLAlchemy has two separate libraries, which you'll note if you look at http://docs.sqlalchemy.org/en/latest/index.html you'll see two distinct halves to the index page - one for the ORM and one for the Core. You cannot use SQLAlchemy effectively without understanding both.

对于快速批量插入的用例,SQLAlchemy 提供了核心,这是 ORM 构建在其上的 SQL 生成和执行系统.有效地使用该系统,我们可以生成与原始 SQLite 版本具有竞争力的 INSERT.下面的脚本说明了这一点,以及预分配主键标识符的 ORM 版本,以便 ORM 可以使用 executemany() 插入行.两个 ORM 版本都将刷新分块为 1000 条记录,这对性能有显着影响.

For the use case of fast bulk inserts, SQLAlchemy provides the core, which is the SQL generation and execution system that the ORM builds on top of. Using this system effectively we can produce an INSERT that is competitive with the raw SQLite version. The script below illustrates this, as well as an ORM version that pre-assigns primary key identifiers so that the ORM can use executemany() to insert rows. Both ORM versions chunk the flushes at 1000 records at a time as well which has a significant performance impact.

此处观察到的运行时是:

Runtimes observed here are:

SqlAlchemy ORM: Total time for 100000 records 16.4133379459 secs
SqlAlchemy ORM pk given: Total time for 100000 records 9.77570986748 secs
SqlAlchemy Core: Total time for 100000 records 0.568737983704 secs
sqlite3: Total time for 100000 records 0.595796823502 sec

脚本:

import time
import sqlite3

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()
DBSession = scoped_session(sessionmaker())

class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))

def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name":'NAME ' + str(i)} for i in range(n)]
    )
    print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"

def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn

def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

另见:http://docs.sqlalchemy.org/en/latest/faq/performance.html

相关文章