HAIP地址被占用导致集群一个节点无法正常启动的处理过程

2020-09-15 00:00:00 集群 节点 启动 地址 第二个

集群一个节点无法启动

  1 查询故障节点的集群的启动状态

[grid@rac2 rac2]$ crsctl stat res -t -init

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS       

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

      1        ONLINE  OFFLINE                                                   

ora.cluster_interconnect.haip

      1        ONLINE  OFFLINE                                                   

ora.crf

      1        ONLINE  ONLINE       rac2                                         

ora.crsd

      1        ONLINE  OFFLINE                                                   

ora.cssd

      1        ONLINE  OFFLINE                                                   

ora.cssdmonitor

      1        ONLINE  ONLINE       rac2                                         

ora.ctssd

      1        ONLINE  OFFLINE                                                   

ora.diskmon

      1        OFFLINE OFFLINE                                                   

ora.evmd

      1        ONLINE  OFFLINE                                                   

ora.gipcd

      1        ONLINE  ONLINE       rac2                                         

ora.gpnpd

      1        ONLINE  ONLINE       rac2                                         

ora.mdnsd

      1        ONLINE  ONLINE       rac2

此时只是启动了构建集群所需要的底层资源 CSSD 没有启动。

2 个节点正常,第二个节点CSSD无法正常启动,通过查看[grid@rac2 cssd]$ tail -100f ocssd.log | more日志发现如下报错

2020-09-13 22:10:16.117: [    CSSD][3833571072]clssgmDiscEndpcl: gipcDestroy 0x96e8

2020-09-13 22:10:16.451: [    CSSD][3833571072]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0x7f06dc0593b0)

2020-09-13 22:10:16.451: [    CSSD][3833571072]clssgmShutDown: Received abortive shutdown request from client.

2020-09-13 22:10:16.451: [    CSSD][3833571072]###################################

2020-09-13 22:10:16.451: [    CSSD][3833571072]clssscExit: CSSD aborting from thread GMClientListener

2020-09-13 22:10:16.451: [    CSSD][3833571072]###################################

2020-09-13 22:10:16.451: [    CSSD][3833571072](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

2020-09-13 22:10:16.552: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

2020-09-13 22:10:16.553: [    CSSD][3480196864]clssnmPollingThread: state(1) clusterState(0) exit

2020-09-13 22:10:16.553: [    CSSD][3480196864]clssscExit: abort already set 0

2020-09-13 22:10:16.561: [    CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt, 

695082, LATS 3972724, lastSeqNo 695081, uniqueness 1599745548, timestamp 1600006216/11824574

2020-09-13 22:10:17.553: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

2020-09-13 22:10:17.563: [    CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt, 

695083, LATS 3973724, lastSeqNo 695082, uniqueness 1599745548, timestamp 1600006217/11825574

2020-09-13 22:10:18.554: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

2020-09-13 22:10:18.566: [    CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt, 

695084, LATS 3974734, lastSeqNo 695083, uniqueness 1599745548, timestamp 1600006218/11826574

2020-09-13 22:10:19.555: [    CSSD][3478619904]clssnmSendingThread: sending join msg to all nodes

2020-09-13 22:10:19.555: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

2020-09-13 22:10:19.555: [    CSSD][3478619904]clssnmSendingThread: sent 4 join msgs to all nodes

2020-09-13 22:10:19.570: [    CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt, 

695085, LATS 3975734, lastSeqNo 695084, uniqueness 1599745548, timestamp 1600006219/11827574

2020-09-13 22:10:20.555: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

2020-09-13 22:10:20.575: [    CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt, 

695086, LATS 3976744, lastSeqNo 695085, uniqueness 1599745548, timestamp 1600006220/11828574

2020-09-13 22:10:21.556: [    CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

这个报错之13号的报错,我是在15号上午重启启动这个节点的集群,此时ocssd.log根本没有记录任何东西,也就是此时集群根本没有启动到这里,应该是卡在底层某个地方了。

通过occd.log历史信息我们发现核心是没有网络心跳

3 测试私网连接,发现在两个节点私网地址互通没问题

[grid@rac2 rac2]$ ping rac1-priv

PING rac1-priv (192.168.57.101) 56(84) bytes of data.

64 bytes from rac1-priv (192.168.57.101): icmp_seq=1 ttl=64 time=0.268 ms

64 bytes from rac1-priv (192.168.57.101): icmp_seq=2 ttl=64 time=0.451 ms

64 bytes from rac1-priv (192.168.57.101): icmp_seq=3 ttl=64 time=0.407 ms

64 bytes from rac1-priv (192.168.57.101): icmp_seq=4 ttl=64 time=0.629 ms

64 bytes from rac1-priv (192.168.57.101): icmp_seq=5 ttl=64 time=0.452 ms

--- rac1-priv ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 4363ms

rtt min/avg/max/mdev = 0.268/0.441/0.629/0.116 ms

[grid@rac2 rac2]$ ping rac2-priv

PING rac2-priv (192.168.57.102) 56(84) bytes of data.

64 bytes from rac2-priv (192.168.57.102): icmp_seq=1 ttl=64 time=0.035 ms

64 bytes from rac2-priv (192.168.57.102): icmp_seq=2 ttl=64 time=0.050 ms

64 bytes from rac2-priv (192.168.57.102): icmp_seq=3 ttl=64 time=0.047 ms

64 bytes from rac2-priv (192.168.57.102): icmp_seq=4 ttl=64 time=0.050 ms

64 bytes from rac2-priv (192.168.57.102): icmp_seq=5 ttl=64 time=0.048 ms

64 bytes from rac2-priv (192.168.57.102): icmp_seq=6 ttl=64 time=0.096 ms

4 再看下网络配置信息ifocnfig -a

发现节点1的网卡eth1上有HAIP地址,而第个节点的eth1上没有HAIP地址,这个就比较奇怪了,很可能第二个节点启动了HAIP网段的地址。

我们在第二个节点查看,发现eth3配置了.

节点1:

eth1:1    Link encap:Ethernet  HWaddr 08:00:27:AF:CA:C2  

          inet addr:169.254.13.18  Bcast:169.254.255.255  Mask:255.255.0.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

节点2

eth1      Link encap:Ethernet  HWaddr 08:00:27:5B:E4:92  

          inet addr:192.168.57.102  Bcast:192.168.57.255  Mask:255.255.255.0

          inet6 addr: fe80::a00:27ff:fe5b:e492/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:28778 errors:0 dropped:0 overruns:0 frame:0

          TX packets:52599 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000 

          RX bytes:16441517 (15.6 MiB)  TX bytes:72160947 (68.8 MiB)

eth3      Link encap:Ethernet  HWaddr 08:00:27:ED:7D:A2         

          inet addr:169.254.1.100  Bcast:169.254.255.255  Mask:255.255.0.0

          inet6 addr: fe80::a00:27ff:feed:7da2/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:11725 errors:0 dropped:0 overruns:0 frame:0

          TX packets:78 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000 

          RX bytes:10457605 (9.9 MiB)  TX bytes:13433 (13.1 KiB)

此时,应该是第二个节点中HAIP网段的地址被使用到了导致问题发生,我们尝试删除该网卡配置(不明确是何时被修改而增加了这个网卡和地址),重启集群

5 删除第二个节点的网卡eth3配置,重启集群,问题解决。

相关文章