从运维角度测试全局死锁以及带来的问题

2021-04-07 00:00:00 节点 等待 死锁 第二个 全局

从运维角度测试全局死锁以及带来的问题

个节点
[oracle@rac2 ~]$ sqlplus scott/tiger@192.168.15.101:1521/prod

SQL> select userenv('sid') from dual;

USERENV('SID')
--------------
279

SQL> select serial# ,sid from v$session where sid=279;

SERIAL# SID
---------- ----------
64364 279

SQL> update emp set ename='test' where empno=7788;

1 row updated.

第二个节点
[oracle@rac2 ~]$ sqlplus scott/tiger@192.168.15.102:1521/prod

SQL> select userenv('sid') from dual;

USERENV('SID')
--------------
49

SQL> select serial# ,sid from v$session where sid=49;

SERIAL# SID
---------- ----------
24429 49

SQL> update emp set ename='test2' where empno=7369;

1 row updated.


继续在第二个节点
SQL> update emp set ename='test3' where empno=7788;

此时卡住,因为该行被个节点持有排他锁

继续在个节点

SQL>
SQL> update emp set ename='test1' where empno=7369;
update emp set ename='test1' where empno=7369
*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource

此时个节点后执行的语句被打断,解除了死锁

看告警日志,这个日志会在个节点alert中出现 ,此时lmd进程发现了并处理死锁
2021-04-07T09:49:33.906946+08:00
Global Enqueue Services Deadlock detected (DID = 8_0_1). More information in file
/oracle/db/base/diag/rdbms/prod/prod1/trace/prod1_lmd0_15769.trc.

继续看第二个节点,此时锁等待还是继续,这里只是Oracle从全局的角度打破了死锁,但是锁等待还是有,这时事务层面的问题。
SQL> update emp set ename='test3' where empno=7788;

继续个节点,回滚数据
SQL> rollback;

Rollback complete.

看第二个节点,此时事务继续执行。

SQL> update emp set ename='test3' where empno=7788;

1 row updated.

在第二个节点我们也回滚数据,结束这次测试。
SQL> rollback;

Rollback complete.


我们继续看全局死锁的lmd的dump文件,我们继续分析日志
[oracle@rac1 trace]$ cat /oracle/db/base/diag/rdbms/prod/prod1/trace/prod1_lmd0_15769.trc | more
Trace file /oracle/db/base/diag/rdbms/prod/prod1/trace/prod1_lmd0_15769.trc
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Build label: RDBMS_12.2.0.1.0_LINUX.X64_170125
ORACLE_HOME: /oracle/db/base/product/12.2
System name: Linux
Node name: rac1
Release: 4.14.35-1902.3.2.el7uek.x86_64
Version: #2 SMP Tue Jul 30 03:59:02 GMT 2019
Machine: x86_64
Instance name: prod1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 22
Unix process pid: 15769, image: oracle@rac1 (LMD0)

开头部分提供了系统信息,数据库版本,操作系统信息,打断全局死锁的实例,该操作的进程LMD0

user session for deadlock lock 0x7ec2dbd0
sid: 279 ser: 64364 audsid: 3130108 user: 108/SCOTT
flags: (0x41) USR/- flags2: (0x40009) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
pid: 55 O/S info: user: grid, term: UNKNOWN, ospid: 4043
image: oracle@rac1
client details:
O/S info: user: oracle, term: pts/0, ospid: 4040
machine: rac1 program: sqlplus@rac1 (TNS V1-V3)
application name: SQL*Plus, hash value=3669949024
current SQL:
update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK
possible owner[55.4043] on resource TX-0005000E-00000ED4-00000000-00000002

死锁会话和SQL

user session for deadlock lock 0x7d0ec918
sid: 46 ser: 25 audsid: 3130107 user: 108/SCOTT
flags: (0x41) USR/- flags2: (0x40009) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
pid: 38 O/S info: user: grid, term: UNKNOWN, ospid: 3874
image: oracle@rac1
client details:
O/S info: user: oracle, term: pts/3, ospid: 3872
machine: rac1 program: sqlplus@rac1 (TNS V1-V3)
application name: SQL*Plus, hash value=3669949024
current SQL:
update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK
possible owner[38.3874] on resource TX-00130004-00001455-00000000-00000004

死锁会话和SQL

下面是全局锁的等待关系图,信息很充足,等待和阻塞关系很明了。
========================================================================
Global Wait-For-Graph(WFG) for GES Deadlock ID=[8_0_1]
------------------------------------------------------------------------
Victim : (instance=1, lock=0x7d0ecb28)
Start (master) Instance : 1
Number of Locks involved : 8
Number of Sessions involved : 4

User session identified by:
{
User Name : oracle
User Machine : rac1
OS Terminal Name : pts/3WN
OS Process ID : 3872
OS Program Name : sqlplus@rac1 (TNS V1-V3)
Application Name : SQL*Plusrac1 (TNS V1-V3)
Action Name : Automatic Report Flushuponment Statistics Flush
Current SQL : update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
Session Number : 46
Session Serial Number : 25
Server Process ORAPID : 38
Server Process OSPID : 3874
Instance : 1
}
waiting for Lock 0x7d0ec918 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x130004.0x1455(ext 0x0,0x4)
GES Transaction ID : 26000-0001-00000046
}
which is blocked by Lock 0x7ec0c318 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x130004.0x1455(ext 0x0,0x4)
GES Transaction ID : 3A000-0002-00000091
}
owned by the
User session identified by:
{
User Name : oracle
User Machine : rac2
OS Terminal Name : pts/0WN
OS Process ID : 11818
OS Program Name : sqlplus@rac2 (TNS V1-V3)
Application Name : SQL*Plusrac2 (TNS V1-V3)
Action Name : ASH Progressive-Flusho instancestatistics Flush
Current SQL : update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
Session Number : 49
Session Serial Number : 24429
Server Process ORAPID : 58
Server Process OSPID : 11820
Instance : 2
}
waiting for Lock 0x7d0ec918 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x5000e.0xed4(ext 0x0,0x2)
GES Transaction ID : 3A000-0002-00000091
}
which is blocked by Lock 0x7d0ec528 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x5000e.0xed4(ext 0x0,0x2)
GES Transaction ID : 24000-0002-000000A2
}
owned by the
User session identified by:
{
User Name : oracle
User Machine : rac2
OS Terminal Name : pts/1WN
OS Process ID : 14496
OS Program Name : sqlplus@rac2 (TNS V1-V3)
Application Name : SQL*Plusrac2 (TNS V1-V3)
Action Name : Auto-CPUUSAGE Actionhresholdst Statistics Flush
Current SQL : update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
Session Number : 46
Session Serial Number : 39025
Server Process ORAPID : 36
Server Process OSPID : 14499
Instance : 2
}
waiting for Lock 0x7d0ec528 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x5000e.0xed4(ext 0x0,0x2)
GES Transaction ID : 24000-0002-000000A2
}
which is blocked by Lock 0x7ec2dbd0 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x5000e.0xed4(ext 0x0,0x2)
GES Transaction ID : 37000-0001-000003E2
}
owned by the
User session identified by:
{
User Name : oracle
User Machine : rac1
OS Terminal Name : pts/0WN
OS Process ID : 4040
OS Program Name : sqlplus@rac1 (TNS V1-V3)
Application Name : SQL*Plusrac1 (TNS V1-V3)
Action Name : KTSJ Slaveblespace Thresholds
Current SQL : update emp set ename=:"SYS_B_0" where empno=:"SYS_B_1"
Session Number : 279
Session Serial Number : 64364
Server Process ORAPID : 55
Server Process OSPID : 4043
Instance : 1
}
waiting for Lock 0x7d0ecb28 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x130004.0x1455(ext 0x0,0x4)
GES Transaction ID : 37000-0001-000003E2
}
which is blocked by Lock 0x7d0ec918 (Transaction):
{
Lock Level : KJUSEREX
Resource Name : TX 0x130004.0x1455(ext 0x0,0x4)
GES Transaction ID : 26000-0001-00000046
}
owned by the first user session of the WFG.
------------------------------------------------------------------------
End of Global WFG for GES Deadlock ID=[8_0_1]
========================================================================

这个部分将全局死锁等待图描述得很清楚。

总结:
全局死锁,属于业务逻辑有问题,从不不同实例,对相同对象得记录做操作,这个问题可以通过指定service让其从一个节点执行,但是这样依然会发生死锁,终还是得从业务逻辑做调整,再者如果该问题业务逻辑无法修改,会生成大量得dump文件,根据情况不通产生得大小也有差异,但是这个问题很可能会撑爆Oracle得软件目录,所以还需要定时删除对应得dump文件,以防止实例挂掉!

相关文章