oracle rac节点频繁重启问题解决办法

作者：袖梨 2022-06-29

这是一个网友的信息，其RAC频繁重启，从9月份以来几乎每间隔2天就要重启，我们先来看看日志。
节点1的alert log：

Tue Oct 28 10:51:40 2014
Thread 1 advanced to log sequence 22792 (LGWR switch)
Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log
Tue Oct 28 10:57:16 2014
Thread 1 advanced to log sequence 22793 (LGWR switch)
Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log
Tue Oct 28 11:04:07 2014
Reconfiguration started (old inc 48, new inc 50)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Oct 28 11:04:09 2014
Tue Oct 28 11:04:09 2014
我们接着来看下节点1的crs的alert log信息：

2014-10-23 07:19:57.145
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.732 seconds
2014-10-23 07:20:05.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.708 seconds
2014-10-23 07:20:09.175
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.702 seconds
2014-10-23 07:20:11.880
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564863; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-23 09:58:11.620
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.141 seconds
2014-10-23 09:58:18.634
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.126 seconds
2014-10-23 09:58:23.660
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.100 seconds
2014-10-23 09:58:25.763
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564865; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-23 14:31:07.140
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.105 seconds
2014-10-23 14:31:14.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.075 seconds
2014-10-23 14:31:19.181
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.063 seconds
2014-10-23 14:31:21.246
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564867; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-25 06:02:39.191
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.748 seconds
2014-10-25 06:02:47.197
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.742 seconds
2014-10-25 06:02:51.203
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.736 seconds
2014-10-25 06:02:53.941
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564869; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-25 06:04:02.765
[crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'.
2014-10-28 11:03:48.965
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.023 seconds
2014-10-28 11:03:55.990
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.998 seconds
2014-10-28 11:04:00.008
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.979 seconds
2014-10-28 11:04:02.988
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564871; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
2014-10-28 11:04:05.992
[cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down
2014-10-28 11:04:05.998
........
[cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 .
从节点1的crs alert log来看，23号，25号以及28号都出现了节点驱逐的情况。单从上面的信息来看，似乎跟网络有关系。
节点1的ocssd.log 如下：

2014-10-28 11:03:47.010: [    CSSD][1029]clssscSelect: cookie accept request 110987400
2014-10-28 11:03:47.010: [    CSSD][1029]clssgmAllocProc: (113c7b590) allocated
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid(26345486) version 11:2:1:4, properties: 0,1,2,3,4
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000
2014-10-28 11:03:47.061: [    CSSD][2577]clssnmSetupReadLease: status 1
........
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-10-28 11:03:48.965: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917536, LATS 2041808730, lastSeqNo 41914523, uniqueness 1414188209, timestamp 1414465442/2128363897
......
2014-10-28 11:03:49.611: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917537, LATS 2041809376, lastSeqNo 41917536, uniqueness 1414188209, timestamp 1414465443/2128364399
2014-10-28 11:03:49.612: [    CSSD][2577]clssnmSetupReadLease: status 1
2014-10-28 11:03:49.617: [    CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
2014-10-28 11:03:49.617: [    CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
2014-10-28 11:03:49.619: [    CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0
2014-10-28 11:03:49.633: [    CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1
2014-10-28 11:03:49.633: [    CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
2014-10-28 11:03:49.635: [    CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0
2014-10-28 11:03:49.671: [    CSSD][1029]clssnmGetNodeNumber: xxdb1
2014-10-28 11:03:49.725: [    CSSD][1029]clssnmGetNodeNumber: xxdb2
.......
2014-10-28 11:03:49.969: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917538, LATS 2041809734, lastSeqNo 41917537, uniqueness 1414188209, timestamp 1414465443/2128364901
.......
2014-10-28 11:03:50.970: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917540, LATS 2041810735, lastSeqNo 41917538, uniqueness 1414188209, timestamp 1414465444/2128365902
.......
2014-10-28 11:03:51.248: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2014-10-28 11:03:51.248: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:03:51.975: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917542, LATS 2041811740, lastSeqNo 41917540, uniqueness 1414188209, timestamp 1414465445/2128366906
.......
2014-10-28 11:04:00.007: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917558, LATS 2041819771, lastSeqNo 41917556, uniqueness 1414188209, timestamp 1414465453/2128374949
2014-10-28 11:04:00.008: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1
2014-10-28 11:04:01.010: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917560, LATS 2041820775, lastSeqNo 41917558, uniqueness 1414188209, timestamp 1414465454/2128375951
2014-10-28 11:04:02.012: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917562, LATS 2041821776, lastSeqNo 41917560, uniqueness 1414188209, timestamp 1414465455/2128376952
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf->endp c1da79
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5
2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871
2014-10-28 11:04:02.988: [    CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 61
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSetupAckWait: Ack message type (11)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete
........
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmNeedConfReq: No configuration to change
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2014-10-28 11:04:02.988: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (1414465455, 2128376952) more than disk timeout of 27000 after the last NHB (1414465426, 2128347958)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 1
2014-10-28 11:04:02.988: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmRemove: Start
2014-10-28 11:04:02.988: [    CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation 265564871, node birth incarnation 265564870, death incarnation 265564871, stateflags 0x234000 uniqueness value 1414188209
........
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmSendShutdown: req to node 2, kill time 2041822753
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmsendmsg: not connected to node 2
可以看到28号，oracle的clssnmPollingThread函数调用出现异常。oracle本质上是通过调用该函数来判断集群节点的心跳是否正常。

从上面的信息来看，似乎提示的是节点2没有网络心跳。
下面的重点是分析节点2的日志了，首先来看下节点2的alert log：

Tue Oct 28 10:59:40 2014
Thread 2 advanced to log sequence 26516 (LGWR switch)
Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log
Tue Oct 28 11:04:18 2014
NOTE: ASMB terminating
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
ASMB (ospid: 23724130): terminating the instance due to error 15064
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (22806846) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (16187426) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (28049650) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
ORA-1092 : opitsk aborting process
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (7864418) as a result of ORA-1092
Instance terminated by ASMB, pid = 23724130
Tue Oct 28 11:05:45 2014
Starting ORACLE instance (normal)
........
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /oracle/product/11.2.0
System name: AIX
Node name: xxdb2
Release: 1
Version: 7
数据库实例的alert log我们基本上看不到什么有价值的信息，下面我们来看下最为关键的节点2的ocssd.log 日志：

2014-10-28 11:03:58.792: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-10-28 11:04:01.217: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932275, LATS 2128362398, lastSeqNo 42873784, uniqueness 1414000248, timestamp 1414465427/2041806941
........
2014-10-28 11:04:13.242: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932297, LATS 2128374423, lastSeqNo 42932295, uniqueness 1414000248, timestamp 1414465439/2041818993
2014-10-28 11:04:13.243: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1
2014-10-28 11:04:14.244: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932299, LATS 2128375425, lastSeqNo 42932297, uniqueness 1414000248, timestamp 1414465440/2041819996
2014-10-28 11:04:14.843: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2014-10-28 11:04:14.843: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:04:15.246: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932301, LATS 2128376426, lastSeqNo 42932299, uniqueness 1414000248, timestamp 1414465441/2041821006
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf->endp 90e
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5
2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871
2014-10-28 11:04:15.994: [    CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 2, from -1, changes 4
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSetupAckWait: Ack message type (11)
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE
2014-10-28 11:04:15.994: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932302, LATS 2128377175, lastSeqNo 42932301, uniqueness 1414000248, timestamp 1414465441/2041821506
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete
.......
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5)
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963)
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 2
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]

----- Call Stack Trace -----
........
2014-10-28 11:04:15.996: [    CSSD][1]clssgmUpdateEventValue: CmInfo State val 2, changes 13
2014-10-28 11:04:15.996: [    CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 265564870, changes 5
2014-10-28 11:04:15.996: [    CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(265564870)
2014-10-28 11:04:15.996: [    CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(265564870)
2014-10-28 11:04:15.996: [    CSSD][1]clssgmStartNMMon: completed node cleanup
2014-10-28 11:04:15.996: [    CSSD][4119]calling              call     entry                argument values in hex
2014-10-28 11:04:15.996: [    CSSD][4119]location             type     point                (? means dubious value)
2014-10-28 11:04:15.996: [    CSSD][3348]clssgmUpdateEventValue: HoldRequest val 1, changes 3
2014-10-28 11:04:15.996: [    CSSD][4119]-------------------- -------- -------------------- ----------------------------
2014-10-28 11:04:15.997: [    CSSD][4119]clssscExit()+708     bl       10006db74            67800000678 ? 1000F33C8 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   11003ADF0 ? 90000000085C914 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmCheckDskInfo( call     clssscExit()         111B08E30 ? 200000002 ?
2014-10-28 11:04:15.997: [    CSSD][4119])+1600                                             11003ADF0 ? 90000000085C914 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmDoSyncUpdate( call     clssnmCheckDskInfo( 111B08E30 ? 110E3F130 ?
2014-10-28 11:04:15.997: [    CSSD][4119])+4016                        )                    000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000004 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmRcfgMgrThread call     clssnmDoSyncUpdate( 111B08E30 ? 110E3F130 ?
2014-10-28 11:04:15.997: [    CSSD][4119]()+2992                       )                    000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   1128045D0 ? 100000001 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssscthrdmain()+20 call     clssnmRcfgMgrThread 111B08E30 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]8                             ()                   000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]_pthread_body()+240 call     clssscthrdmain()     111B08E30 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]fffffffffffffffc     call     _pthread_body()      000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]
.......
----- End of Call Stack Trace -----

2014-10-28 11:04:16.052: [    CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down
2014-10-28 11:04:16.052: [    CSSD][4119]clssscExit: Starting CRSD cleanup
2014-10-28 11:04:16.052: [    CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x00000000, kill block unique 0, my unique 1414188209
.......
2014-10-28 11:04:16.995: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932304, LATS 2128378175, lastSeqNo 42932302, uniqueness 1414000248, timestamp 1414465442/2041822515
2014-10-28 11:04:17.004: [    CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit
2014-10-28 11:04:17.004: [    CSSD][3605]clssscExit: abort already set 1
2014-10-28 11:04:17.053: [    CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564871, stamp 2041822753
2014-10-28 11:04:17.053: [    CSSD][2320]clssscExit: abort already set 1
........
2014-10-28 11:04:17.995: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932306, LATS 2128379175, lastSeqNo 42932304, uniqueness 1414000248, timestamp 1414465443/2041823528
........
2014-10-28 11:04:18.600: [    CSSD][4119]clssscExit: CRSD cleanup successfully completed
2014-10-28 11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2cca484453d04f96bf8ea33b240c4c97,icin=265564851,nmn=2,lnid=265564870,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2014-10-28 11:04:18.602: [    CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
2014-10-28 11:04:18.602: [    CSSD][4119]clssgmUpdateEventValue: CmInfo State val 0, changes 14
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking RPC Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking dead client Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking dead proc Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: Aborting client (1114fb0f0) proc (111dceb10), iocapables 1.
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (19595302), iocapables 1, client (1114fb0f0)
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: Aborting client (11152a530) proc (111d29b50), iocapables 2.

我们过滤掉节点2的ocssd中的关键信息，可以发现如下的内容：
25号:

2014-10-25 06:03:07.804: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564869, wrtcnt, 42656043, LATS 1851108984, lastSeqNo 42656042, uniqueness 1414000248, timestamp 1414188173/1764553533
.......
2014-10-25 06:03:07.804: [    CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564869, stamp 1764553706
2014-10-25 06:03:07.804: [    CSSD][2320]###################################
2014-10-25 06:03:07.804: [    CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable
2014-10-25 06:03:07.804: [    CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2014-10-25 06:03:07.804: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564869)
2014-10-25 06:03:07.804: [    CSSD][2320]###################################
28号：

2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963)
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 2
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
单纯的从25号和28号的数据来看，这2次故障其实是不同的。我们上面的关键信息我们可以发现，25号的原cssd异常是调用clssnmvKillBlockThread出现问题，而
28号是clssnmRcfgMgrThread。
显然，这2个函数是完全不同的类型，第一个函数的针对votedisk的操作，而第2个函数是网络相关的函数.
最后问这哥们最近做过什么变动，据说是换了交换机之后就出现这个现象了。
跟大家分享这个小的案例！

oracle rac节点频繁重启问题解决办法

相关文章

精彩推荐