操作系统内核升级导致基于Veritas存储的Oracle RAC无法启动

2022-03-02 00:00:00 集群 生产 检查 进程 发现


由于操作系统的Kernel bug,要求对没有上线的系统升级,选择了一套集群做测试,发现操作系统升级后
RAC无法启动,下面记录下后的是分析结果


1 crsctl stat res -t -init看到集群cssd服务无法启动
ora.cssd
1 ONLINE OFFLINE STABLE
2 通过查看集群 alert 日志
[CSSDMONITOR(30734)]CRS-8500: Oracle Clusterware CSSDMONITOR process
is starting with operating system process ID 30734
[OCSSD(29988)]CRS-8504: Oracle Clusterware OCSSD process with operating
system process ID 29988 is exiting
[OHASD(28371)]CRS-2878: Failed to restart resource 'ora.cssd' <<<<<启动失败

3 集群 cssdmonitor 日志
检查 cssdmonitor 日志,发现是由于 VMON 初始化失败而导致无法通信,致使 clsncssd_reboot 关闭了 cssd 进程。
14:26:30.286 : USRTHRD:3964106496: clsnvmon_main: vmon getting active
14:26:30.286 : USRTHRD:3964106496: clsnvmon_skgxnGroupName: name(OCLSMON_)
14:26:30.286 : USRTHRD:3964106496: clsnvmon_initskgxn: Failure initializingskgxn context. <<<<<< VMON 初始化失败
14:26:30.286 : USRTHRD:3964106496: clsncssd_logose: slos [5], SLOS depend-msg [Error 0], SLOS error-msg [0]
14:26:30.286 : USRTHRD:3964106496: clsncssd_logose: SLOS other info is [Initconnection failed].
14:26:30.286 : USRTHRD:3964106496: clsncssd_reboot (VMON): fatal 0 clidead 0 mode 4 dev 0
14:26:30.286 : USRTHRD:3964106496: clsncssd_reboot: sending state change to CSS
14:26:30.286 : USRTHRD:3964106496: clsnpollSendMsg: Trying Sendsync for message -13
14:26:30.287 : USRTHRD:3964106496: clsnpollSendMsg: Sendsync complete for message- 13, with status - gipcretSuccess(0)
14:26:30.287 : USRTHRD:3964106496: clsncssd_reboot: shutdown immediate sent to CSS <<<<<<<clsncssd_reboot 关闭了 cssd 进程
14:26:30.387 : USRTHRD:3964106496: clsncssd_reboot: waiting for CSS to be down
查询Veritas的资料 VMON 进程是 Oracle 集群和 Veritas 集群进行交互的接口。
4 文件以及进程检查

发现库文件已经做了链接
#ll libskgxn2*
lrwxrwxrwx 1 grid oinstall 33 Feb 22 14:18 libskgxn2.so -> /etc/ORCLcluster/lib/libskgxn2.so
#ll /etc/ORCLcluster/lib/libskgxn2.so
lrwxrwxrwx 1 root root 22 Feb 14 17:36 /etc/ORCLcluster/lib/libskgxn2.so ->/usr/lib64/libvcsmm.so

检查进程状态
[root@w-pc-i620-212 ~]#systemctl status vcsmm
● vcsmm.service - Veritas Membership Manager (VCSMM)
Loaded: loaded (/opt/VRTSvcs/rac/bin/vcsmm; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2022-02-22 11:45:38 CST; 5 days ago <<<<<<vcsmm进程状态为failed
Main PID: 79772 (code=exited, status=2)
Feb 22 11:45:37 w-pc-i620-212 systemd[1]: Starting Veritas Membership Manager (VCSMM)...
Feb 22 11:45:38 w-pc-i620-212 vcsmm[79772]: Starting VCSMM:
Feb 22 11:45:38 w-pc-i620-212 vcsmm[79772]: This script is not allowed to start VCSMM. VCSMM_START
is not 1
Feb 22 11:45:38 w-pc-i620-212 systemd[1]: vcsmm.service: main process exited, code=exited,
status=2/INVALIDARGUMENT
Feb 22 11:45:38 w-pc-i620-212 systemd[1]: Failed to start Veritas Membership Manager (VCSMM).
Feb 22 11:45:38 w-pc-i620-212 systemd[1]: Unit vcsmm.service entered failed state.
Feb 22 11:45:38 w-pc-i620-212 systemd[1]: vcsmm.service failed.

5 启动vcsmm服务
参看文档
https://www.veritas.com/content/support/en_US/doc/79664151-149462782-0/v13609401-149462782

6编译 grid 组件操作
在重启 crs 的过程中发现 asm 无法启动,检查 asm alert 日志有以下报错,根据文件(Doc ID 1997729.1)
进行处理,执行重新编译操作。
[root@w-pc-i620-212 /grid/12.2/crs/install]#./rootcrs.sh -unlock
[grid@w-pc-i620-240 /grid/12.2/rdbms/lib]$make -f ins_rdbms.mk rac_on ipc_g ioracle

[root@w-pc-i620-212 /grid/12.2/crs/install]#./rootcrs.sh -lock

7 数据库无法自动启动
在重启 crs 的过程中发现 db 资源无法自动拉起,重新 relink db 目录。Relink 之后能够正常启动数据库资
源。
. For details refer to "(:CLSN00107:)" in
"/grid/base/diag/crs/w-pc-i620-212/crs/trace/crsd_oraagent_oracle.trc".
11:54:55.246 [ORAAGENT(104719)]CRS-8500: Oracle Clusterware ORAAGENT process is
starting with operating system process ID 104719
11:54:57.321 [ORAAGENT(104719)]CRS-5017: The resource action "ora.tjdm36.db start"
encountered the following error:
11:54:57.321+ORA-12547: TNS:lost contact
执行命令
[oracle@w-pc-i620-212 ~]$relink all
writing relink log to: /oracle/12.2/install/relink.log

总结:升级完 kernel 之后,由 Veritas 工程师检查 Veritas 集群的 vcsmm 状态,之后按照文档(Doc ID 1997729.1) 重新编译 grid 下组件,并在 db 软件下重新 relink db 组件。

相关文章