加veritas共享文件系统后实例无法启动

2021-06-24 00:00:00 磁盘 异常 结点 进程 实例

添加veritas共享文件系统后ASM磁盘I/O异常造成实例无法启动,这个 veritas共享文件是为复制软件用的,跟Oracle没有关系,系统组在处理完后,我们发现了问题,下面是处理过程。其实我们定为了IO问题,但是具体哪里问题后还是需要系统和存储工程师去协调处理。下面是处理过程。

一套 三节点集群,使用ASM共享存储,发现2,3结点数据库实例无法启动,1节点实例又频繁重启。下面是分析过程。

CKPT进程写文件异常 LMHB 发现后台核心进程异常terminate实例。
021-06-23T20:04:39.016044+08:00
CKPT (ospid: 20145) waits for event 'control file parallel write' for 94 secs.  <<<<<<CKPT进程写文件异常
2021-06-23T20:04:41.177782+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_lmhb_20127.trc (incident=486956):
ORA-29770: global enqueue process CKPT (OSID 20145) is hung for more than 70 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx012/incident/incdir_486956/xxxxx012_lmhb_20127_i486956.trc
2021-06-23T20:04:43.647557+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxxx01 by memno 1
ERROR: Some process(s) is not making progress.
LMHB (ospid: 20127) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 20127): terminating the instance due to error 29770 <<<<<<LMHB 发现后台核心进程异常terminate实例。
2021-06-23T20:04:44.149112+08:00
System state dump requested by (instance=2, osid=20127 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_diag_20054_20210623200444.trc
2021-06-23T20:04:45.052499+08:00
Dumping diagnostic data in directory=[cdmp_20210623200444], requested by (instance=2, osid=20127 (LMHB)), summary=[abnormal instance termination].
2021-06-23T20:04:45.702580+08:00
License high water mark = 6
2021-06-23T20:04:49.393904+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec
2021-06-23T20:04:54.395504+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 0 sec
2021-06-23T20:04:55.395871+08:00
Instance termination failed to kill one or more processes
2021-06-23T20:04:59.398103+08:00
Instance terminated by LMHB, pid = 20127
2021-06-23T20:04:59.399099+08:00
Warning: 2 processes are still attach to shmid 19:
(size: 32768 bytes, creator pid: 7297, last attach/detach pid: 45950)
2021-06-23T20:04:59.704748+08:00
USER (ospid: 3914): terminating the instance
2021-06-23T20:04:59.705632+08:00
Instance terminated by USER, pid = 3914


结点2 集群尝试启动,如下步骤很慢,数据库日志如下
NOTE: ASMB mounting group 2 (DATA1)
NOTE: ASM background process initiating disk discovery for grp 2 (reqid:0)
2021-06-24T10:35:51.989120+08:00
KSXPPING: KSXP selected for Ping
2021-06-24T10:36:50.257524+08:00
Decreasing number of high priority LMS from 3 to 0
2021-06-24T10:39:01.164090+08:00
NOTE: disk group discovery running slower than expected, ASMB diagnostic requested after 189 seconds
NOTE: ASMB process state dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_ckpt_47894.trc

后续逐渐发现了ASM磁盘,但是后续启动步骤非常慢,ASMB负责与ASM实例通信,在启动时负责初始化磁盘组,找到对应的ASM磁盘,
但是这个过程持续189秒ASMB进程开始dump。 

数据库日志

这里同样是控制文件读写异常,也是i/o问题,超时。由于这个异常实例被关闭。
2021-06-24T10:54:43.712362+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 71 secs.
2021-06-24T10:54:43.712463+08:00
IMR0 (ospid: 1872) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T10:54:46.719241+08:00
IMR has experienced some problems and can't re-enable, retry count 1
2021-06-24T10:54:54.018174+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 91 secs.
2021-06-24T10:56:21.809079+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 169 secs.
2021-06-24T10:56:21.809159+08:00
IMR0 (ospid: 1872) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T10:56:32.416434+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 189 secs.
......

2021-06-24T11:04:34.400380+08:00
Reconfiguration started (old inc 6, new inc 8)
List of instances (total 2) :
2 3
Dead instances (total 1) :
1
My inst 2
.....

2021-06-24T11:06:04.341089+08:00
IMR0 (ospid: 1872) waits for event 'control file parallel write' for 82 secs.
2021-06-24T11:07:20.598935+08:00
requested by (instance=2, osid=47050 (DIA0)), summary=[process state dump request (kjznHA)]
requested by (instance=2, osid=47050 (DIA0)), summary=[process state dump request (kjznHA)]
2021-06-24T11:09:29.246934+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_ora_2453.trc (incident=535284):
ORA-00240: control file enqueue held for more than 120 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx012/incident/incdir_535284/xxxxx012_ora_2453_i535284.trc
2021-06-24T11:11:23.302105+08:00

* instance 3 validates domain 0
2021-06-24T11:11:25.438586+08:00
License high water mark = 4
2021-06-24T11:11:25.439008+08:00
USER (ospid: 33543): terminating the instance
2021-06-24T11:11:29.443778+08:00
......
2021-06-24T11:11:45.451636+08:00
Instance terminated by USER, pid = 33543
2021-06-24T11:11:45.452326+08:00
Warning: 4 processes are still attach to shmid 34:
(size: 32768 bytes, creator pid: 13106, last attach/detach pid: 33543)


处理第二个结点问题时,个结点重起
LGWR 写盘异常,asmb(监控核心后台进程)  结束实例

2021-06-24T11:26:07.035994+08:00
CKPT (ospid: 12600) waits for event 'control file parallel write' for 71 secs.
2021-06-24T11:26:07.036087+08:00
CKPT (ospid: 12600) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T11:26:09.675861+08:00
LGWR (ospid: 12598) waits for event 'LGWR all worker groups' for 23 secs.
2021-06-24T11:26:14.050214+08:00
CKPT (ospid: 12600) waits for event 'control file parallel write' for 80 secs.
2021-06-24T11:26:16.295247+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx011/trace/xxxxx011_lmhb_12582.trc (incident=555113):
ORA-29770: global enqueue process CKPT (OSID 12600) is hung for more than 70 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx011/incident/incdir_555113/xxxxx011_lmhb_12582_i555113.trc
2021-06-24T11:26:18.830136+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxxx01 by memno 0
ERROR: Some process(s) is not making progress.
LMHB (ospid: 12582) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 12582): terminating the instance due to error 29770
2021-06-24T11:26:19.197840+08:00
System state dump requested by (instance=1, osid=12582 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx011/trace/xxxxx011_diag_12473_20210624112619.trc
。。。。。。
2021-06-24T11:26:44.194255+08:00
Instance terminated by LMHB, pid = 12582


这里可以看出CKPT进程异常,无法写入,LMHB进程监控到这个问题,在dump后随之Instance terminated by LMHB
虽然后续结点1实例启动成功,但是问题依旧。在确认三个结点都可以读取asm磁盘,且权限属组都没有问题后,我们判断是
i/o ,1,3结点可以访问ASM磁盘但是偶尔会慢造成无法访问,而结点2根本无法访问,i/o更慢。 

总结:以上现象都说明存储I/O异常,不稳定需要系统组排查,后经项目负责人协调后确认系统组已经意识到I/O异常,进一步分析处理,
至此该问题分析完毕。

相关文章