学说2 - 如何提高冲洗效率?

2022-01-03 00:00:00 php mysql doctrine-orm

我必须更新我的 Doctrine 实体以匹配(可能非常大)XML 文件中的记录.我还必须根据 XML 中的数据更新 ManyToMany 关联.这就是我在循环中所做的:

I have to update my Doctrine entities to match records inside of (potentionaly very large) XML file. I have to also update ManyToMany associations according to data in the XML. This is what I do inside of a loop:

  1. 从 XML 中获取数据
  2. 从数据库中获取实体(如果不存在则创建新的)
  3. 设置新的实体属性
  4. 获取当前实体关联(getter 返回ArrayCollection 对象)
  5. 清除所有关联(通过调用ArrayCollection::clear())
  6. 设置新关联(通过在子循环中调用ArrayCollection::add())
  7. 通过 EntityManager 持久化实体

在循环之后我调用EntityManager::flush().

问题在于刷新会生成大量查询,而不是一次更新/插入/删除多行.对于每个实体都执行以下查询:

The problem is that flushing generates large amount of queries instead of updating/inserting/deleting multiple rows at once. For every entity are executed following queries:

  • SELECT 从数据库中获取实体
  • UPDATE 以更新实体属性(现在实际上已跳过,因为尚未更改任何属性......)
  • 删除以清除以前的关联
  • INSERT 插入新的关联

所以总共有 305 条 XML 记录,我得到 915 个查询(我猜如果所有实体都发生变化,它可能会增加到 1220 个查询),这使得导入非常缓慢.

So in total for 305 records in XML i get 915 queries (I guess it could go up to 1220 queries if all entities would changed) which makes the import very slow.

我可以在循环之前利用 IdentityMap 和预取实体,但仍然存在 UPDATE/DELETE/INSERT 查询.

I could take advantage of IdentityMap and pre-fetch entities before loop, but there are still the UPDATE/DELETE/INSERT queries.

  • 有没有办法让flush方法更好地优化查询(使用多插入、WHERE IN而不是多个DELETE查询等)?
  • 这是flush方法的正常行为还是我做错了什么?
  • 也许我更新实体关联的方式有问题.有没有更好的方法来做到这一点?(而不是get/clear/add"方法)
  • 我知道 Doctrine 不适合批量处理,但我认为将其用于 XML 导入是避免 DB 不一致的最佳方法,这种不一致可能会出现在非 ORM 方法中.是吗?
  • 如果上面的方法是错误的,我应该如何解决问题?

推荐答案

你做对了——只是速度很慢,因为添加的 ORM 抽象意味着你不能进行你想要的那种优化.

You're doing it right -- it's just slow, because the added abstraction of the ORM means you can't make the sorts of optimizations you'd like.

也就是说,EntityManager 在处理这么大的事务时确实变慢了.如果您不是在一次大事务中绝对需要它们,您可能可以通过每 20 到 200 次循环迭代刷新()然后清除()EM 来获得更高性能的代码.

That said, the EntityManager does get slow on transactions that large. If you don't absolutely need them all in one big transaction, you can probably get more performant code by flush()ing and then clear()ing the EM every 20-200 iterations of your loop.

如果这不能让您获得足够的性能,我能想到的唯一替代方法是恢复到直接针对您的 DBMS 运行自定义 SQL 的自定义代码.

If that doesn't get you enough performance, the only alternative that I can think of is to revert to custom code that runs custom SQL directly against your DBMS.

我知道这不是一个很好的答案,但至少我可以告诉你,你并不疯狂.

I know this isn't a great answer, but at least I can tell you that you're not crazy.

------ 编辑------

来自关于 批处理的官方 Doctrine2 文章:

------ edit ------

From official Doctrine2 article on Batch processing:

有些人似乎想知道为什么 Doctrine 不使用多插入(插入 (...) 值 (...), (...), (...), ...

Some people seem to be wondering why Doctrine does not use multi-inserts (insert into (...) values (...), (...), (...), ...

首先,这个语法只在mysql及更新版本上支持PostgreSQL 版本.其次,没有简单的方法可以掌握所有使用时在这种多插入中生成的标识符AUTO_INCREMENT 或 SERIAL 和 ORM 需要标识的标识符对象的管理.最后,插入性能很少是ORM 的瓶颈.普通的插入速度足够快大多数情况下,如果你真的想做快速批量插入,那么无论如何,多插入不是最好的方法,即 Postgres COPY 或 MysqlLOAD DATA INFILE 快了几个数量级.

First of all, this syntax is only supported on mysql and newer postgresql versions. Secondly, there is no easy way to get hold of all the generated identifiers in such a multi-insert when using AUTO_INCREMENT or SERIAL and an ORM needs the identifiers for identity management of the objects. Lastly, insert performance is rarely the bottleneck of an ORM. Normal inserts are more than fast enough for most situations and if you really want to do fast bulk inserts, then a multi-insert is not the best way anyway, i.e. Postgres COPY or Mysql LOAD DATA INFILE are several orders of magnitude faster.

这些是不值得努力实施一个在 mysql 和 postgresql 上执行多插入的抽象ORM.

These are the reasons why it is not worth the effort to implement an abstraction that performs multi-inserts on mysql and postgresql in an ORM.

当使用远程与本地数据库时,性能也有显着差异,因为将每个查询发送到远程服务器的开销非常大.由于事务和数据库优化,使用本地数据库的开销要低得多.(例如,在问题示例的情况下,70 秒降低到 300 毫秒)

Also there is a significant difference in performance when using remote vs local database as overhead of sending each query to remote server is quite large. The overhead is much lower while using local database thanks to transactions and DB optimizations. (e.g. 70sec lowered to 300ms in the case of example in the question)

相关文章