重新启动死锁事务时导致后续错误的原因是什么?

2022-01-15 00:00:00 mariadb mysql galera

在提交阶段重新启动失败的事务时,我在重新启动事务时遇到第二次失败.这是在 MariaDB 10.2.6 下运行 Galera Cluster.

When restarting a failed transaction at commit stage I get a second failure when restarting the transaction. This is running Galera Cluster under MariaDB 10.2.6.

事件的顺序是这样的:

  1. 提交一个事务(比如一次插入).
  2. COMMIT 失败并出现 错误 1213 尝试获取锁时发现死锁"
  3. 开始一个新事务以重放 SQL 语句.
  4. BEGIN 失败并出现 错误 1047 WSREP 尚未准备好节点供应用程序使用"
  5. 我的应用程序保释以避免更严重的崩溃(请参阅下面的注释)
  1. Commit a transaction (say a single insert).
  2. COMMIT fails with error 1213 "Deadlock found when trying to get lock"
  3. Begin a new transaction to replay the SQL statement[s].
  4. BEGIN fails with error 1047 "WSREP has not yet prepared node for application use"
  5. My application bails to avoid a more serious crash (see notes below)

这种情况经常发生,尽管集群恢复了,但个别线程会收到故障.昨天这种情况在一秒钟内发生了 15 次.

This happens quite regularly and although the cluster recovers, individual threads receive failures. Yesterday this happened 15 times in one second.

我无法确定任何根本原因.看来死锁是问题的始作俑者.这种情况应该是可以恢复的(而且经常是)但是由于多个客户端都试图同时解决他们的死锁,整个事情似乎只是失败了.

I cannot identify any root cause for this. It seems that the deadlock is the initiator of the problem. The situation should be recoverable (and often is) But with multiple clients all trying to resolve their deadlocks at the same time, the whole thing seems to just fail.

注意事项:

这与 较早的问题 重试失败的事务会导致集群完全崩溃.我已经设法通过仅在死锁上重试事务来防止崩溃.即,如果在重新启动期间发生不同类型的错误,应用程序将放弃.

This is related to an earlier question where retrying failed transactions caused total crash of the cluster. I've managed to prevent crashes by retrying transactions only on deadlocks. i.e. if a different type of error occurs during a restart the application gives up.

我知道 10.2.6 不是 MariaDB 的最新版本.我现在很紧张,因为我有过如此糟糕的经历.我想在升级之前了解当前的问题,但我无法在测试环境中重现错误.

I'm aware that 10.2.6 is not the latest version of MariaDB. I'm nervous to upgrade right now as I've had such bad experiences. I would like to understand the current problem before doing an upgrade and I've been unable to reproduce the errors in a test environment.

推荐答案

我不确定,但我怀疑 3 次尝试(不是 2 次)是合适的.提交涉及两个步骤:

I'm not sure, but I suspect 3 tries (not 2) is appropriate. Committing involves two steps:

  • 仅在您连接的节点内检查死锁.(例如:另一个查询触及同一行或间隙.)
  • 与其他节点核对,看看他们是否会抱怨.(例如:同一行已经插入到另一个节点中.)

当然,其中任何一个都可能以任何顺序重复发生.但是尝试 3 次似乎是合理的.

Sure, either of those could happen repeatedly, and in any order. But making 3 tries seems reasonable.

现在,一旦您失败了太多"次,就应该放弃并让人类(DBA 类型)参与进来.我怀疑您可以以某种方式重组您的代码/应用程序逻辑/等,以避免大多数故障.您是否愿意提供更多详细信息,以便我们讨论这种可能性...

Now, once you have failed "too many" times, it is right to abort and get a human (a DBA type) involved. I suspect that you could restructure your code / application logic / etc in some way to avoid most of the failures. Would you like to provide more details, so we can discuss that possibility...

  • 什么样的桌子?(队列、事务、日志等)
  • 显示创建表.(auto_inc、唯一键等;太多的 UNIQUE 键会加重情况)
  • INSERT 是什么样的?
  • 您多久运行一次这样的插入?它多久失败一次?(检测您的代码,以便计算那些可以恢复的代码.)
  • 集群的分布范围如何?(ping 时间)
  • 还有哪些其他查询正在访问该表?(他们可能会加剧问题.)
  • What kind of table? (Queue, transactions, logging, etc)
  • SHOW CREATE TABLE. (auto_inc, unique keys, etc; too many UNIQUE keys can aggravate the situation)
  • What does the INSERT look like?
  • How often do you run inserts like this one? How often does it fail? (Instrument your code so you count even those that you can recover from.)
  • How spread out is the Cluster? (ping time)
  • What other queries are hitting the table? (They may be aggravating the issue.)

相关文章