由于HAIP地址冲突导致的RAC数据库实例重启故障分析

2021-12-15 00:00:00 集群地址进程实例重启

HAIP是Oracle的RAC用于数据库实例与ASM实例之间通信使用的地址，这个地址由RAC软件分配，要求集群私网中不允许由相同网段地址，否则会引起集群通信异常，这次是一个典型的案例，涉及IBM小机管理的IMM机制涉及的地址，同时改系统在安装时也不规范，否则也不会引起这个隐藏故障。

1. 故障表现
 实例testdb1重启，早上业务反馈重启后时快时慢。。
 数据库告警日志报：LMS1 (ospid: 10636) has not called a wait for 70 secs. opiodr aborting process unknown ospid (10142) as a result of ORA-609 ORA-29770: global enqueue process LMS2 (OSID 10640) is hung for more than 70 seconds, ORA-29743: exiting from instance membership recovery protocol because this instance is not in good health, JYGL(5):USER (ospid: 19749): terminating the instance due to error 481
 用户反应业务时快时慢。
2. 故障分析、建议
分析过程
1 检查数据库告警日志，从1号22:16分开始陆续出现集群进程异常，包括LMS，LMD
2021-12-01T22:16:21.900038+08:00
LMS1 (ospid: 10636) has not called a wait for 70 secs.
2021-12-01T22:16:21.900332+08:00
LMS1 (ospid: 10636) is hung in an acceptable location ( 0x2.ffff).
2021-12-01T22:17:13.808835+08:00
LMS3 (ospid: 10642) has not called a wait for 49 secs.
2021-12-01T22:17:13.809045+08:00
LMS3 (ospid: 10642) is hung in an acceptable location ( 0x2.ffff).
......
2021-12-01T23:04:59.658678+08:00
LMS0 (ospid: 10635) has not called a wait for 19 secs.
2021-12-01T23:04:59.659455+08:00
LMS0 (ospid: 10635) is hung in an acceptable location ( 0x2.ffff).
2021-12-01T23:06:06.117488+08:00
开始出现Oracle进程无法响应连接申请
2021-12-01T23:22:08.682809+08:00
opiodr aborting process unknown ospid (10142) as a result of ORA-609
2021-12-01T23:22:15.509835+08:00
opiodr aborting process unknown ospid (10237) as a result of ORA-609
......
2021-12-02T00:22:16.822734+08:00
Archived Log entry 349601 added for T-1.S-180227 ID 0x4b8b13d3 LAD:1
2021-12-02T00:22:16.828232+08:00
opiodr aborting process unknown ospid (28872) as a result of ORA-609
2021-12-02T00:22:25.618633+08:00
opiodr aborting process unknown ospid (28950) as a result of ORA-609
2021-12-02T00:22:34.166012+08:00
opiodr aborting process unknown ospid (28976) as a result of ORA-609
这里省略后续类似输出

又出现LMS进程异常，此时从wait时间看，私网不稳定，LMS间断通信等待
2021-12-02T00:47:22.631931+08:00
LMS3 (ospid: 10642) has not called a wait for 18 secs.
2021-12-02T00:47:22.632187+08:00
LMS3 (ospid: 10642) is hung in an acceptable location ( 0x2.ffff).
2021-12-02T00:49:18.719856+08:00
LMS1 (ospid: 10636) has not called a wait for 45 secs.
2021-12-02T00:49:18.720097+08:00
LMS1 (ospid: 10636) is hung in an acceptable location ( 0x2.ffff).
2021-12-02T00:50:40.838369+08:00

后续陆续出现LMS进程异常，但是都在阀值范围内（in an acceptable location）

2021-12-02T01:05:15.932807+08:00
LMS1 (ospid: 10636) has not called a wait for 4 secs.
2021-12-02T01:05:15.933105+08:00
LMS1 (ospid: 10636) is hung in an acceptable location ( 0x2.ffff).
LMS2 (ospid: 10640) has not called a wait for 273 secs.
2021-12-02T01:06:41.836829+08:00
YSSC(3):minact-scn: got error during useg scan e:12751 usn:3
YSSC(3):minact-scn: useg scan erroring out with error e:12751

lmnb进程发现进程LMS2 hung住，产生trc文件，记录时间，这是分析问题核心。
2021-12-02T01:07:14.769396+08:00
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_lmhb_10658.trc (incident=852329) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process LMS2 (OSID 10640) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/testdb/testdb1/incident/incdir_852329/testdb1_lmhb_10658_i852329.trc

Oracle发现网络unhealth,并提示会关闭实例，后续会分析不健康原因。
2021-12-02T01:12:06.495803+08:00
This instance is not in good health.
Either some of its receivers are not making progress, or
the system network is not in good health.
Hence this instance is gracefully terminating itself and <<<<<<gracefully关闭实例
will not participate in the current IMR reconfiguration.

lmhb再次打出trace文件
2021-12-02T01:12:06.496844+08:00
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_lmon_10631.trc:
ORA-29743: exiting from instance membership recovery protocol because this instance is not in good health
2021-12-02T01:12:06.517893+08:00

LMON进程开始关闭实例
JYGL(5):USER (ospid: 19749): terminating the instance due to error 481
2021-12-02T01:12:06.557337+08:00
JYGL(5):opiodr aborting process unknown ospid (14484) as a result of ORA-1092
一系列拒绝连接
2021-12-02T01:12:06.812384+08:00
JYGL(5):opiodr aborting process unknown ospid (842) as a result of ORA-1092

由于非正常关闭实例，Oracle开始打出SSD。
2021-12-02T01:12:06.814099+08:00
System State dumped to trace file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_diag_10616_20211202011206.trc
2021-12-02T01:12:06.821951+08:00

1:13分关闭实例完毕
2021-12-02T01:13:52.280720+08:00
Instance terminated by USER, pid = 11651

1:14:26开始重启
2021-12-02T01:14:26.647968+08:00
Starting ORACLE instance (normal) (OS id: 11953)

此时到早上上班时间数据库都处于运行状态，但是LMS进程依然异常
下面是出现过一次的记录
2021-12-02T07:50:01.484237+08:00
LMS3 (ospid: 12165) has not called a wait for 25 secs.
2021-12-02T07:50:01.484492+08:00
LMS3 (ospid: 12165) is hung in an acceptable location ( 0x2.ffff).
2021-12-02T07:51:24.751903+08:00
LMS3 (ospid: 12165) has not called a wait for 116 secs.
2021-12-02T07:53:54.641565+08:00
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_lmhb_12181.trc (incident=876331) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process LMS3 (OSID 12165) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/testdb/testdb1/incident/incdir_876331/testdb1_lmhb_12181_i876331.trc

分析lmhb跟踪文件
实例重启前，LMS2进程等待超过273秒，Oracle记录了该进程异常信息,

*** 2021-12-02T01:05:15.937646+08:00 (CDB$ROOT(1))
==============================
LMS2 (ospid: 10640) has not moved for 274 sec (1638378314.1638378040)
: heartbeat check status 6 (no-heartbeat) (threshold 70 sec)

*** 2021-12-02T01:05:20.939906+08:00 (CDB$ROOT(1))
==================================================

*** 2021-12-02T01:05:31.123684+08:00 (CDB$ROOT(1))
=== LMS2 (ospid: 10640) Heartbeat Report
==================================================
LMS2 (ospid: 10640) has no heartbeats for 279 sec. (threshold 70)
: heartbeat state 0x2.ffff () pso-flag 0x100
: Not in wait; last wait ended 273 secs ago.
: last wait_id 2622895779 at 'gcs remote message'

*** 2021-12-02T01:11:51.467993+08:00 (CDB$ROOT(1))
LMS1 (ospid: 10636) has no heartbeats for 619 sec. (threshold 70)
: heartbeat state 0x2.ffff () pso-flag 0x100
: Not in wait; last wait ended 6 secs ago.
: last wait_id 2606404062 at 'kjctssqmg: quick message send wait'.
说明LMS丢失心跳，超时，Oracle的LMON进程开始重启实例。这里可以确定是LMS进程丢失网络心跳导致实例重启。

下面分析为何网络心跳会异常，从之前分析发现Oracle报网络不健康，lms是走私网的HAIP地址的流量，所以怀疑私网本身异常或者地址异常。后我们手工重启实例，通过Oracle自身的健康检查发现了异常，此时的数据库告警日志如下
2021-12-02T15:07:27.304431+08:00
Using default pga_aggregate_limit of 122880 MB
2021-12-02T15:07:28.800725+08:00
This instance is not in good health and terminating itself.
2021-12-02T15:07:28.801266+08:00
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_lmon_16214.trc:
ORA-29743: exiting from instance membership recovery protocol because this instance is not in good health
2021-12-02T15:07:28.918144+08:00
Error: Shutdown in progress. Error: 29743.
2021-12-02T15:07:29.015580+08:00
LMON (ospid: 16214): terminating the instance due to error 481
2021-12-02T15:07:29.049918+08:00
System state dump requested by (instance=1, osid=4294983510 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_diag_16198_20211202150729.trc
2021-12-02T15:07:30.045967+08:00
Dumping diagnostic data in directory=[cdmp_20211202150729], requested by (instance=1, osid=4294983510 (LMON)), summary=[abno
rmal instance termination].
2021-12-02T15:07:31.423012+08:00
Instance terminated by LMON, pid = 16214

通过分析lmon跟踪文件发现了问题。

************lmon告警
=========================
Local instance 1 uses 4 interfaces.
[0]: 169.254.41.56
[1]: 169.254.111.220
[2]: 169.254.182.82
[3]: 169.254.199.102
================================
== System Network Information ==
================================
==[ Network Interfaces : 13 (13 max) ]============
aggr0 | 192.168.130.51 | 255.255.255.0 | UP|RUNNING
aggr0:1 | 169.254.182.82 | 255.255.192.0 | UP|RUNNING
aggr0:3 | 169.254.111.220 | 255.255.192.0 | DOWN|RUNNING
aggr1 | 10.3.122.179 | 255.255.255.128 | UP|RUNNING
aggr1:1 | 10.3.122.181 | 255.255.255.128 | UP|RUNNING
net4 | 172.16.20.11 | 255.255.255.0 | UP|RUNNING
net4:1 | 169.254.41.56 | 255.255.192.0 | UP|RUNNING
net4:2 | 169.254.182.82 | 255.255.192.0 | DOWN|RUNNING
net4:3 | 169.254.199.102 | 255.255.192.0 | UP|RUNNING
net5 | 172.16.11.11 | 255.255.255.0 | UP|RUNNING
net5:1 | 169.254.111.220 | 255.255.192.0 | UP|RUNNING
net5:2 | 169.254.199.102 | 255.255.192.0 | DOWN|RUNNING

下面是集群私网配置情况
grid@testdba:~$ oifcfg getif
aggr0 192.168.130.0 global cluster_interconnect,asm
aggr1 10.3.121.119 global public
net4 172.16.20.0 global cluster_interconnect
net5 172.16.12.0 global cluster_interconnect
下面是实例无法重启时刻网卡的实际地址分配情况
**********************grid用户
grid@testdba:~$ ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr0: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 3
inet 192.168.130.51 netmask ffffff00 broadcast 192.168.130.255
aggr0:1: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 3
inet 169.254.182.82 netmask ffffc000 broadcast 169.254.191.255
aggr0:3: flags=100001000842<BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 3
inet 169.254.111.220 netmask ffffc000 broadcast 169.254.127.255
inet 10.3.122.181 netmask ffffff80 broadcast 10.3.122.255
net4: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 8
inet 172.16.20.11 netmask ffffff00 broadcast 172.16.20.255
net4:1: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 8
inet 169.254.41.56 netmask ffffc000 broadcast 169.254.63.255
net4:2: flags=100001000842<BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 8
inet 169.254.182.82 netmask ffffc000 broadcast 169.254.191.255
net4:3: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 8
inet 169.254.199.102 netmask ffffc000 broadcast 169.254.255.255
net5: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 7
inet 172.16.11.11 netmask ffffff00 broadcast 172.16.11.255
net5:1: flags=100001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 7
inet 169.254.111.220 netmask ffffc000 broadcast 169.254.127.255
net5:2: flags=100001000842<BROADCAST,RUNNING,MULTICAST,IPv4,PHYSRUNNING> mtu 1500 index 7
inet 169.254.199.102 netmask ffffc000 broadcast 169.254.255.255

发现aggr0 ，net4，net5是集群的私网，其中aggr0即是ASM心跳也是集群心跳（ LMS等用）但是多出了aggr0:3，net4:2，net5:2，也就是多出了三个169.254网段地址，按照Oracle标准如果用户配置3块私网应该有4个HAIP地址，目前确有7个，也就是多出了三个，这里不正常，根据Oracle的提示
我们手工关闭多出来的异常虚拟网卡
ifconfig aggr0:3 unplumb
ifconfig net4:2 unplumb
ifconfig net5:2 unplumb

后重启实例成功。

通过查询资料发现IBM的小机存在一个管理地址aggr0连接，与OracleHAIP 发成冲突，导致私网异常，LMS超时，实例重启，这点通过这个小机工程师给与了确认，下面是
小机工程师的结论：
aggr0配置着两个网络段，一个是192.168.130.51 一个是169.254.182.82。169.254.182网段是用来和ilom 管理os进行内部通讯用的。不应该用来进行进群心跳。
当集群心跳有问题的时候会尝试从169.254.182网段检查心跳状态，该网段又有业务流量，会被标记为不健康。

 解决建议：
1、请主机工程师解决地址冲突，从主机入手禁止该网段地址进入aggr0端口，或者禁止该地址段进入该交换机.
2、从集群层面删除aggr0私网，这点需要测试，因为aggr0端口也是ASM的心跳，需要测试切换到其他端口和删除具体流程，以保证生产库的运行安全。
3、系统卡顿由小机工程师解决：
目前删除了部分文件释放了空间，连接卡顿问题解决，小机工程师解释如下：
OS系统所在磁盘繁忙度非常高导致。当前系统盘1T，使用率85%，并且系统有多层备份快照，怀疑是文件较多导致的系统盘相应变慢。

相关文章