更换主板后RAC一个实例启动异常-反复重启(私网LMS通信异常)

2021-05-27 00:00:00 集群 生产 节点 实例 剔出

由于CPU问题,更换主板,系统拉起后,发现集群不正常,关闭集群重启,发现数据库实例无法正常启动,同时实例不断重启,下面是分析过程。

节点2 数据库告警日志
Starting background process LCK0
2021-05-27T12:29:44.794777+08:00
LCK0 started with pid=78, OS id=112747
2021-05-27T12:29:46.375795+08:00
KSXPPING: KSXP selected for Ping
2021-05-27T12:34:16.096466+08:00
LMS0 (ospid: 112183) has detected no messaging activity from instance 1 >>>>>>>>>没有来自节点1的message
2021-05-27T12:34:16.103933+08:00
LMS1 (ospid: 112185) has detected no messaging activity from instance 1
2021-05-27T12:34:16.106833+08:00
USER (ospid: 112183) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-05-27T12:34:16.107308+08:00
Communications reconfiguration: instance_number 1 by ospid 112183
2021-05-27T12:34:16.114242+08:00
USER (ospid: 112185) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-05-27T12:34:16.116948+08:00
LMON (ospid: 112179) drops the IMR request from LMS1 (ospid: 112185) because IMR is in progress and inst 1 is marked bad.
2021-05-27T12:34:19.386352+08:00
Errors in file /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/trace/xxyy20dg2_lmon_112179.trc (incident=288177):
ORA-29740: evicted by instance number 1, group incarnation 24 <<<<<<<<<超时被节点1剔出
Incident details in: /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/incident/incdir_288177/xxyy20dg2_lmon_112179_i288177.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-05-27T12:34:19.731297+08:00
Errors in file /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/trace/xxyy20dg2_lmon_112179.trc:
ORA-29740: evicted by instance number 1, group incarnation 24
Errors in file /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/trace/xxyy20dg2_lmon_112179.trc (incident=288178):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/incident/incdir_288178/xxyy20dg2_lmon_112179_i288178.trc
2021-05-27T12:34:19.861982+08:00
LCK0 (ospid: 112747): terminating the instance due to error 481    <<<<<<<LCK0因为错误481关闭实例
2021-05-27T12:34:20.000555+08:00
System state dump requested by (instance=2, osid=112747 (LCK0)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/xxyy20dg/xxyy20dg2/trace/xxyy20dg2_diag_112158_20210527123420.trc
2021-05-27T12:34:20.460971+08:00
Dumping diagnostic data in directory=[cdmp_20210527123420], requested by (instance=2, osid=112747 (LCK0)), summary=[abnormal instance termination].
2021-05-27T12:34:21.023305+08:00
License high water mark = 2
2021-05-27T12:34:25.873669+08:00
Instance terminated by LCK0, pid = 112747
2021-05-27T12:34:25.875255+08:00
Warning: 2 processes are still attach to shmid 688134:
(size: 32768 bytes, creator pid: 111580, last attach/detach pid: 112154)
2021-05-27T12:34:26.025110+08:00
USER (ospid: 123846): terminating the instance
2021-05-27T12:34:26.026599+08:00
Instance terminated by USER, pid = 123846
2021-05-27T12:34:27.570396+08:00
Adjusting the default value of parameter parallel_max_servers
from 3840 to 2520 due to the value of parameter processes (3000)
Starting ORACLE instance (normal) (OS id: 124374)
2021-05-27T12:34:27.581928+08:00
CLI notifier numLatches:131 maxDescs:5068

节点1
Reconfiguration complete (total time 0.7 secs)
2021-05-27T12:33:42.191256+08:00
RFS[10]: Selected log 13 for T-1.S-19093 dbid 3749585617 branch 997869779
2021-05-27T12:33:43.986961+08:00
Archived Log entry 10665 added for T-1.S-19092 ID 0xe3ee674b LAD:1
2021-05-27T12:34:19.100266+08:00
LMS1 (ospid: 225532) has detected no messaging activity from instance 2  >>>>>>>>>没有来自节点2的message
USER (ospid: 225532) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-05-27T12:34:19.104213+08:00
Suppressed nested communications reconfiguration: instance_number 2
2021-05-27T12:34:19.119729+08:00
LMS0 (ospid: 225528) has detected no messaging activity from instance 2  >>>>>>>>>没有来自节点2的message
USER (ospid: 225528) issues an IMR to resolve the situation
Please check USER trace file for more detail.
2021-05-27T12:34:19.123806+08:00
LMON (ospid: 225521) drops the IMR request from LMS0 (ospid: 225528) because IMR is in progress and inst 2 is marked bad.
Evicting instance 2 from cluster                     >>>>>>>>>节点2被剔出
Evicting instance 2 from cluster                     
Waiting for instances to leave: 2
2021-05-27T12:34:25.227015+08:00
Reconfiguration started (old inc 22, new inc 26)
List of instances (total 2) :
1 3
Dead instances (total 1) :
2
My inst 1
Global Resource Directory frozen
Communication channels reestablished


结点2的lms1进程监控信息
*** 2021-05-27T12:32:54.017373+08:00
IPCLW:[0.73]{E}[WAIT]:PROTO: [1622089974017311]ACNH 0x7ffff099de70: 64 attempts to connect:
IPCLW:[0.74]{-}[WAIT]:UTIL: [1622089974017311] ACNH 0x7ffff099de70 State: 0 SMSN: 1315317114 MSN: 1315317116 Seq: 361774455 # Pending: 2
IPCLW:[0.75]{-}[WAIT]:UTIL: [1622089974017311] Peer: [UNKNWN].0 AckSeq: 0
IPCLW:[0.76]{-}[WAIT]:UTIL: [1622089974017311] Flags: 0x00000000 IHint: 0x7e992880000001f THint: 0x0
IPCLW:[0.77]{-}[WAIT]:UTIL: [1622089974017311] Local Address: 169.254.175.169:13054 Remote Address: 169.254.94.79:49773 >>>>ping地址
IPCLW:[0.78]{-}[WAIT]:UTIL: [1622089974017311] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 8965 xdata2 f8e387e0
IPCLW:[0.79]{-}[WAIT]:UTIL: [1622089974017311] : mmsz 8472 mmr 9200 mms 2 xdata c34778d2
IPCLW:[0.80]{-}[WAIT]:UTIL: [1622089974017311] IVPort: 49248 TVPort: 30930 IMPT: 65074 RMPT: 35173 Pending Sends: Yes Unacked Sends: Yes
  >>>>>>>>>>>>>>>>>>LMS执行pending发送,但是没有确认ACK信息。
IPCLW:[0.81]{-}[WAIT]:UTIL: [1622089974017311] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 64 credits 0/0
IPCLW:[0.82]{-}[WAIT]:UTIL: [1622089974017311] Unackd Messages 1315317114 -> 1315317115. SSEQ 361774453 Send Time: INVALID TIME SMSN # Xmits
: 0 EMSN INVALID TIME
IPCLW:[0.83]{-}[WAIT]:UTIL: [1622089974017311] Pending send queue:
IPCLW:[0.84]{-}[WAIT]:UTIL: [1622089974017311] [0] Mbuf 0x7ffff1b4f7d0 MSN 1315317114 Seq 361774453 -> 361774454 # XMits: 0
IPCLW:[0.85]{-}[WAIT]:UTIL: [1622089974017311] [1] Mbuf 0x7ffff1b4eb70 MSN 1315317115 Seq 361774454 -> 361774455 # XMits: 0
IPCLW:[0.86]{E}[WAIT]:PROTO: [1622089974017311]ACNH 0x7ffff099d2b0: 64 attempts to connect:
IPCLW:[0.87]{-}[WAIT]:UTIL: [1622089974017311] ACNH 0x7ffff099d2b0 State: 0 SMSN: 636777413 MSN: 636777414 Seq: 709697713 # Pending: 1
IPCLW:[0.88]{-}[WAIT]:UTIL: [1622089974017311] Peer: [UNKNWN].0 AckSeq: 0
IPCLW:[0.89]{-}[WAIT]:UTIL: [1622089974017311] Flags: 0x00000000 IHint: 0x7e992880000001d THint: 0x0
IPCLW:[0.90]{-}[WAIT]:UTIL: [1622089974017311] Local Address: 169.254.175.169:13054 Remote Address: 169.254.34.64:61100
IPCLW:[0.91]{-}[WAIT]:UTIL: [1622089974017311] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 53ae xdata2 2522800
IPCLW:[0.92]{-}[WAIT]:UTIL: [1622089974017311] : mmsz 8472 mmr 9200 mms 2 xdata c348d8ac
IPCLW:[0.93]{-}[WAIT]:UTIL: [1622089974017311] IVPort: 49248 TVPort: 55468 IMPT: 65074 RMPT: 21422 Pending Sends: Yes Unacked Sends: Yes
IPCLW:[0.94]{-}[WAIT]:UTIL: [1622089974017311] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 64 credits 0/0
IPCLW:[0.95]{-}[WAIT]:UTIL: [1622089974017311] Unackd Messages 636777413 -> 636777413. SSEQ 709697712 Send Time: INVALID TIME SMSN # Xmits:
0 EMSN INVALID TIME
IPCLW:[0.96]{-}[WAIT]:UTIL: [1622089974017311] Pending send queue:
IPCLW:[0.97]{-}[WAIT]:UTIL: [1622089974017311] [0] Mbuf 0x7ffff1b4f3b0 MSN 636777413 Seq 709697712 -> 709697713 # XMits: 0

*** 2021-05-27T12:33:42.275875+08:00
IPCLW:[0.98]{E}[WAIT]:PROTO: [1622090022275813]ACNH 0x7ffff099de70: 80 attempts to connect:
IPCLW:[0.99]{-}[WAIT]:UTIL: [1622090022275813] ACNH 0x7ffff099de70 State: 0 SMSN: 1315317114 MSN: 1315317116 Seq: 361774455 # Pending: 2
IPCLW:[0.100]{-}[WAIT]:UTIL: [1622090022275813] Peer: [UNKNWN].0 AckSeq: 0
IPCLW:[0.101]{-}[WAIT]:UTIL: [1622090022275813] Flags: 0x00000000 IHint: 0x7e992880000001f THint: 0x0
IPCLW:[0.102]{-}[WAIT]:UTIL: [1622090022275813] Local Address: 169.254.175.169:13054 Remote Address: 169.254.94.79:49773
IPCLW:[0.103]{-}[WAIT]:UTIL: [1622090022275813] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 8965 xdata2 f8e387e0
IPCLW:[0.104]{-}[WAIT]:UTIL: [1622090022275813] : mmsz 8472 mmr 9200 mms 2 xdata c34778d2
IPCLW:[0.105]{-}[WAIT]:UTIL: [1622090022275813] IVPort: 49248 TVPort: 30930 IMPT: 65074 RMPT: 35173 Pending Sends: Yes Unacked Sends: Yes
IPCLW:[0.106]{-}[WAIT]:UTIL: [1622090022275813] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 80 credits 0/0
IPCLW:[0.107]{-}[WAIT]:UTIL: [1622090022275813] Unackd Messages 1315317114 -> 1315317115. SSEQ 361774453 Send Time: INVALID TIME SMSN # Xmit
s: 0 EMSN INVALID TIME
IPCLW:[0.108]{-}[WAIT]:UTIL: [1622090022275813] Pending send queue:
IPCLW:[0.109]{-}[WAIT]:UTIL: [1622090022275813] [0] Mbuf 0x7ffff1b4f7d0 MSN 1315317114 Seq 361774453 -> 361774454 # XMits: 0
IPCLW:[0.110]{-}[WAIT]:UTIL: [1622090022275813] [1] Mbuf 0x7ffff1b4eb70 MSN 1315317115 Seq 361774454 -> 361774455 # XMits: 0
IPCLW:[0.111]{E}[WAIT]:PROTO: [1622090022275813]ACNH 0x7ffff099d2b0: 80 attempts to connect:
IPCLW:[0.112]{-}[WAIT]:UTIL: [1622090022275813] ACNH 0x7ffff099d2b0 State: 0 SMSN: 636777413 MSN: 636777414 Seq: 709697713 # Pending: 1
IPCLW:[0.113]{-}[WAIT]:UTIL: [1622090022275813] Peer: [UNKNWN].0 AckSeq: 0
IPCLW:[0.114]{-}[WAIT]:UTIL: [1622090022275813] Flags: 0x00000000 IHint: 0x7e992880000001d THint: 0x0
IPCLW:[0.115]{-}[WAIT]:UTIL: [1622090022275813] Local Address: 169.254.175.169:13054 Remote Address: 169.254.34.64:61100
IPCLW:[0.116]{-}[WAIT]:UTIL: [1622090022275813] Remote PID: ver 0 flags 1 trans 2 tos 0 opts 0 xdata3 53ae xdata2 2522800
IPCLW:[0.117]{-}[WAIT]:UTIL: [1622090022275813] : mmsz 8472 mmr 9200 mms 2 xdata c348d8ac
IPCLW:[0.118]{-}[WAIT]:UTIL: [1622090022275813] IVPort: 49248 TVPort: 55468 IMPT: 65074 RMPT: 21422 Pending Sends: Yes Unacked Sends: Yes
IPCLW:[0.119]{-}[WAIT]:UTIL: [1622090022275813] Send Engine Queued: No sshdl -1 ssts 0 rtts 0 snderrchk 0 creqcnt 80 credits 0/0
IPCLW:[0.120]{-}[WAIT]:UTIL: [1622090022275813] Unackd Messages 636777413 -> 636777413. SSEQ 709697712 Send Time: INVALID TIME SMSN # Xmits:
0 EMSN INVALID TIME
IPCLW:[0.121]{-}[WAIT]:UTIL: [1622090022275813] Pending send queue:
IPCLW:[0.122]{-}[WAIT]:UTIL: [1622090022275813] [0] Mbuf 0x7ffff1b4f3b0 MSN 636777413 Seq 709697712 -> 709697713 # XMits: 0

实例层面的私网通信有问题,根据LCK0 (ospid: 112747): terminating the instance due to error 481,我们在MOS找到一个类似案例

RAC DB creation fails with error - LMON (ospid: XXXX): terminating the instance due to ORA error 481 (Doc ID 2626738.1)

经过检查和测试,确实解决了问题。
Cause': 在Oracle/RHLinux7默认使用Strict Reverse Path Filtering, 私网开启了Strict模式会造成私网通信的中断,从而导致实例被剔出,Oracle
推荐使用loose模式。

我们检查参数设置,发现网名是错误的,且为0(不开启反向路径过滤验证),
net.ipv4.conf.enp175s0f0.rp_filter = 0
net.ipv4.conf.enp176s0f1.rp_filter = 0

所以解决方式就是修改参数为如下

vi /etc/sysctl.conf

net.ipv4.conf.enp8s0f0.rp_filter = 2
net.ipv4.conf.enp89s0f1.rp_filter = 2

sysctl -p     <<<<参数内存生效

后启动数据库
sqlplus / as sysdba
startup       <<<<<成功









相关文章