oracle数据库A RAC node crash due to ora-00481

作者:袖梨 2022-06-29

这是某个客户的案例,这里分享给大家! 在2015/1/13号凌晨3:44分左右,XXXX集群数据库的节点1出现出现crash。
通过分析XXXX1节点的告警日志,我们发现如下内容:

Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lmon_10682988.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
USER: terminating instance due to error 481
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lms0_27525728.trc:
ORA-00481: LMON process terminated with error
.......省略部分内容
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lms1_27001440.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
System state dump is made for local instance
System State dumped to trace file /home/oracle/app/admin/XXXX/bdump/XXXX1_diag_28246956.trc
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lmd0_27198128.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_mman_28378004.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:43 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lck0_25952674.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:44 2015
Errors in file /home/oracle/app/admin/XXXX/bdump/XXXX1_lgwr_33489026.trc:
ORA-00481: LMON process terminated with error
Tue Jan 13 03:44:44 2015
Doing block recovery for file 94 block 613368
Tue Jan 13 03:44:45 2015
Shutting down instance (abort)
License high water mark = 1023
Tue Jan 13 03:44:49 2015
Instance terminated by USER, pid = 19333184
Tue Jan 13 03:44:55 2015
Instance terminated by USER, pid = 33554510
Tue Jan 13 03:45:46 2015
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name

从上述日志来看,在3:44:43时间点,节点1的LMON进程出现异常被终止,抛出ORA-00481错误。接着节点1的数据库实例被强行终止掉。
对于Oracle 的LMON进程,其中作用主要是监控RAC的GES信息,当然其作用不仅仅局限于此,还负责检查集群中各个Node的健康情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由Clusterware来完成。那么LMON进程检查到实例级别出现脑裂时,会通知Clusterware来进行脑裂操作,然而其并不会等待Clusterware的处理结果。当等待超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的Instance membership reconfig。
从上述的日志分析,我们可以看出,节点1实例是被LMON进程强行终止的,而LMON进程由于本身出现异常才采取了这样的措施。那么,节点1的LMON进程为什么会出现异常呢?通过分析节点1数据库实例LMON进程的trace 内容,我们可以看到如下内容:

*** 2015-01-13 03:44:18.067
kjfcdrmrfg: SYNC TIMEOUT (1295766, 1294865, 900), step 31
Submitting asynchronized dump request [28]
KJC Communication Dump:
 state 0x5  flags 0x0  mode 0x0  inst 0  inc 68
 nrcv 17  nsp 17  nrcvbuf 1000
 reg_msg: sz 456  cur 1235 (s:0 i:1235) max 5251  ini 3750
 big_msg: sz 8240  cur 263 (s:0 i:263) max 1409  ini 1934
 rsv_msg: sz 8240  cur 0 (s:0 i:0) max 0  tot 1000
 rcvr: id 1  orapid 7  ospid 27525728
 rcvr: id 9  orapid 15  ospid 26149404
 .......
 .......
 rcvr: id 7  orapid 13  ospid 17105074
 rcvr: id 16  orapid 22  ospid 29033450
 send proxy: id 1  ndst 1 (1:1 )
 send proxy: id 9  ndst 1 (1:9 )
 .......
 .......
 send proxy: id 7  ndst 1 (1:7 )
 send proxy: id 16  ndst 1 (1:16 )
GES resource limits:
 ges resources: cur 0 max 0 ini 39515
 ges enqueues: cur 0 max 0 ini 59069
 ges cresources: cur 4235 max 7721
 gcs resources: cur 4405442 max 5727836 ini 7060267
 gcs shadows: cur 4934515 max 6358617 ini 7060267
KJCTS state: seq-check:no  timeout:yes  waitticks:0x3  highload no
GES destination context:
GES remote instance per receiver context:
GES destination context:
.......
kjctseventdump-end tail 238 heads 0 @ 0 238 @ -744124571
sync() timed out - lmon exiting
kjfsprn: sync status  inst 0  tmout 900 (sec)
kjfsprn: sync propose inc 68  level 85020
kjfsprn: sync inc 68  level 85020
kjfsprn: sync bitmap 0 1
kjfsprn: dmap ver 68 (step 0)
.......
DUMP state for lmd0 (ospid 27198128)
DUMP IPC context for lmd0 (ospid 27198128)
Dumping process 6.27198128 info:

从上面LMON进程的trace信息来看,LMON进程检测到了DRM在进行sync时出现了timeout,最后LMON强制退出了。既然如此,那么我们就来分析为什么DRM会出现timeout;同时,我们也知道DRM的主要进程其实是LMD进程,那么我们来分析节点1实例的LMD进程的trace内容:

*** 2015-01-13 03:44:43.666
lmd abort after exception 481
KJC Communication Dump:
 state 0x5  flags 0x0  mode 0x0  inst 0  inc 68
 nrcv 17  nsp 17  nrcvbuf 1000
 reg_msg: sz 456  cur 1189 (s:0 i:1189) max 5251  ini 3750
 big_msg: sz 8240  cur 261 (s:0 i:261) max 1409  ini 1934
 rsv_msg: sz 8240  cur 0 (s:0 i:0) max 0  tot 1000
 rcvr: id 1  orapid 7  ospid 27525728
 .......
 rcvr: id 7  orapid 13  ospid 17105074
 rcvr: id 16  orapid 22  ospid 29033450
 send proxy: id 1  ndst 1 (1:1 )
 send proxy: id 9  ndst 1 (1:9 )
 .......
 send proxy: id 7  ndst 1 (1:7 )
 send proxy: id 16  ndst 1 (1:16 )
GES resource limits:
 ges resources: cur 0 max 0 ini 39515
 ges enqueues: cur 0 max 0 ini 59069
 ges cresources: cur 4235 max 7721
 gcs resources: cur 4405442 max 5727836 ini 7060267
 gcs shadows: cur 4934515 max 6358617 ini 7060267
KJCTS state: seq-check:no  timeout:yes  waitticks:0x3  highload no
GES destination context:
GES remote instance per receiver context:
GES destination context:

我们可以看到,当lmon进程遭遇ORA-00481错误之后,lmd进程也会强制abort终止掉。在LMON进程被强制终止掉之前,触发了一个process dump,如下:

*** 2015-01-13 03:44:18.114
Dump requested by process [orapid=5]
REQUEST:custom dump [2] with parameters [5][6][0][0]
Dumping process info of pid[6.27198128] requested by pid[5.10682988]
Dumping process 6.27198128 info:
*** 2015-01-13 03:44:18.115
Dumping diagnostic information for ospid 27198128:
OS pid = 27198128
loadavg : 1.71 1.75 2.33
swap info: free_mem = 13497.62M rsv = 96.00M
alloc = 342.91M avail = 24576.00M swap_free = 24233.09M
 F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
 240001 A   oracle 19530440 10682988  10  65 20 16ae3ea590  1916          03:44:18      -  0:00 /usr/bin/procstack 27198128
 242001 T   oracle 27198128        1   1  60 20 7412f4590 108540            Dec 29      - 569:20 ora_lmd0_XXXX1
procstack: open(/proc/27198128/ctl): Device busy
*** 2015-01-13 03:44:18.420
通过上述的分析,我们可以看到ORA-00481错误的产生是关键,而这个错误是LMON进程产生的。
对于ORA-00481错误来讲,根据Oracle MOS文档(1950963.1)描述,通常有如下几种可能性的原因:
1)实例无法获得LE(Lock Elements)锁
2)RAC流控机制的 tickets不足
根据文档描述,我们从数据库两个节点的LMS进程trace中未发现如下的关键字信息:
Start affinity expansion for pkey 81885.0
Expand failed: pkey 81885.0, 229 shadows traversed, 153 replayed 1 retries
因此,我们可以排除第一种可能性。 同理,我们从lmd 进程的trace文件中,可以看到如下类似信息:

GES destination context:
Dest 1  rcvr 0  inc 68  state 0x10041  tstate 0x0
 batch-type quick  bmsg 0x0  tmout 0x20f0dd31  msg_in_batch 0
tkt total 1000  avl 743 sp_rsv 242 max_sp_rsv 250
 seq wrp 0  lst 268971339  ack 0  snt 268971336
 sync seq 0.268971339  inc 0  sndq enq seq 0.268971339
 batch snds 546480  tot msgs 5070830  max sz 88  fullload 85  snd seq 546480
 pbatch snds 219682271  tot msgs 267610831
 sndq msg tot 225339578  tm (0 17706)
 sndq msg 0  maxlmt 7060267  maxlen 149  wqlen 225994573
 sndq msg 0  start_tm 0  end_tm 0

我们从上述红色部分内容可以看出,tickets是足够的,因此我们也可以排除第2种情况。换句话讲,该ORA-00481错误的产生,本身并不是Oracle RAC的配置问题导致。
对于LMON检查到DRM操作出现timeout,最后导致实例crash。timeout的原因通常有如下几种:
1)操作系统Load极高,例如CPU极度繁忙,导致进程无法获得CPU资源
2)进程本身处理异常,比如进程挂起
3)网络问题,比如数据库节点之间通信出现异常
4)DRM本身机制的不完善
5)Oracle DRM Bug
从上面的信息来看,系统在出现异常时,操作系统的Load是很低的,因此第一点我们可以直接排除。
我们现在的目的是需要分析出LMON检查到了什么异常,以及为什么会出现异常。LMD进程在abort之前进行了dump,那么我们可以从dump 中寻找一些蛛丝马迹,如下:

PROCESS 5:
 ----------------------------------------
 SO: 700001406331850, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=5, calls cur/top: 7000014054d75e0/7000014054d75e0, flag: (6) SYSTEM
 int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 24
 last post received-location: ksasnd
 last process to post me: 700001405330198 1 6
 last post sent: 0 0 24
 last post sent-location: ksasnd
 last process posted by me: 7000014023045d0 1 2
 (latch info) wait_event=0 bits=0
 Process Group: DEFAULT, pseudo proc: 70000140336dd88
 O/S info: user: oracle, term: UNKNOWN, ospid: 10682988
OSD pid info: Unix process pid: 10682988, image: oracle@tpihxdb1 (LMON)
Dump of memory from 0x07000014022E9320 to 0x07000014022E9528
......
7000014022E9520 00000000 00000000                    [........]
(FOB) flags=67 fib=7000013c30cedf0 incno=0 pending i/o cnt=0
 fname=/oradata2/XXXX/control03.ctl
 fno=2 lblksz=16384 fsiz=1626
 (FOB) flags=67 fib=7000013c30cea50 incno=0 pending i/o cnt=0
 fname=/oradata1/XXXX/control02.ctl
 fno=1 lblksz=16384 fsiz=1626
 (FOB) flags=67 fib=7000013c30ce6b0 incno=0 pending i/o cnt=0
 fname=/oradata1/XXXX/control01.ctl
 fno=0 lblksz=16384 fsiz=1626
 ----------------------------------------
 SO: 7000014036e36a8, type: 19, owner: 700001406331850, flag: INIT/-/-/0x00
 GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=68
 outq=0 sndq=0 opid=5 prmb=0x0
mbg[i]=(158041 85020) mbg[b]=(3534178 0) mbg[r]=(0 0)
 fmq[i]=(20 1) fmq[b]=(5 0) fmq[r]=(0 0)
 mop[s]=1 mop[q]=3692218 pendq=0 zmbq=0
 nonksxp_recvs=0
 ------------process 0x7000014036e36a8--------------------
 proc version      : 0
 Local node        : 0
 pid               : 10682988
 lkp_node          : 0
 svr_mode          : 0
 proc state        : KJP_NORMAL
 Last drm hb acked : 0
 Total accesses    : 2
 Imm.  accesses    : 1
 Locks on ASTQ     : 0
 Locks Pending AST : 0
 Granted locks     : 0
 AST_Q:
PENDING_Q:
GRANTED_Q:
----------------------------------------
 SO: 7000014036efd68, type: 19, owner: 7000014036e36a8, flag: INIT/-/-/0x00
 ----------------------------------------
 SO: 70000136e79f658, type: 18, owner: 7000014036efd68, flag: INIT/-/-/0x00
 ----------enqueue 0x70000136e79f658------------------------
 lock version     : 1
 Owner node       : 1
 grant_level      : KJUSEREX
 req_level        : KJUSEREX
 bast_level       : KJUSEREX
 notify_func      : 0
 resp             : 70000140c6d9d40
 procp            : 7000014036efd68
 pid              : 0
 proc version     : 0
 oprocp           : 0
 opid             : 0
 group lock owner : 0
 xid              : 0000-0000-00000000
 dd_time          : 0.0 secs
 dd_count         : 0
 timeout          : 0.0 secs
 On_timer_q?      : N
 On_dd_q?         : N
 lock_state       : GRANTED
 Open Options     :
Convert options  :
History          : 0x9c8d
 Msg_Seq          : 0x1
 res_seq          : 2
 valblk           : 0x00000000000000000000000000000000 .
 ----------resource 0x70000140c6d9d40----------------------
 resname       : [0x19][0x2],[RS]
 Local node    : 0
 dir_node      : 0
 master_node   : 0
 hv idx        : 122
 hv last r.inc : 68
 current inc   : 68
 hv status     : 0
 hv master     : 0
 open options  :
grant_bits    : KJUSERNL KJUSEREX
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
 count         : 1         0         0         0         0         1
 val_state     : KJUSERVS_DUBVALUE
 valblk        : 0x00000000000000000000000000000000 .
 access_node   : 1
 vbreq_state   : 0
 state         : x0
 resp          : 70000140c6d9d40
 On Scan_q?    : N
 Total accesses: 132825
 Imm.  accesses: 123137
 Granted_locks : 1
Cvting_locks  : 1
value_block:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 GRANTED_Q :
 lp 70000136e79f658 gl KJUSEREX rp 70000140c6d9d40 [0x19][0x2],[RS]
 master 0 owner 1  bast 0 rseq 2 mseq 0x1 history 0x9c8d
 open opt
CONVERT_Q:
lp 700001403a3d0c0 gl KJUSERNL rl KJUSEREX rp 70000140c6d9d40 [0x19][0x2],[RS]
 master 0 pid 25428644 bast 0 rseq 3 mseq 0 history 0x9a
 convert opt KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
Rdomain number is 0
 ----------------------------------------
 SO: 7000014064d88c8, type: 4, owner: 700001406331850, flag: INIT/-/-/0x00
 (session) sid: 1652 trans: 0, creator: 700001406331850, flag: (51) USR/- BSY/-/-/-/-/-
 DID: 0001-0005-00000006, short-term DID: 0000-0000-00000000
 txn branch: 0
 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
 service name: SYS$BACKGROUND
 last wait for 'ges generic event' blocking sess=0x0 seq=35081 wait_time=158 seconds since wait started=0
 =0, =0, =0
 Dumping Session Wait History
 for 'ges generic event' count=1 wait_time=158
 =0, =0, =0
 .......
 ----------------------------------------
 SO: 7000013c2f52018, type: 41, owner: 7000014064d88c8, flag: INIT/-/-/0x00
 (dummy) nxc=0, nlb=0
----------------------------------------
 SO: 7000014035cb1e8, type: 11, owner: 700001406331850, flag: INIT/-/-/0x00
 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 700001406331850,
 event: 5, last message event: 70,
 last message waited event: 70,                        next message: 0(0), messages read: 1
 channel: (7000014074b6298) system events broadcast channel
 scope: 2, event: 129420, last mesage event: 70,
 publishers/subscribers: 0/915,
 messages published: 1
 ----------------------------------------
 SO: 7000014054d75e0, type: 3, owner: 700001406331850, flag: INIT/-/-/0x00
 (call) sess: cur 7000014064d88c8, rec 0, usr 7000014064d88c8; depth: 0
 ----------------------------------------
 SO: 7000014036e3060, type: 16, owner: 700001406331850, flag: INIT/-/-/0x00
 (osp req holder)
从上面LMON进程本身的dump来看,节点1实例的LMON进程状态是正常的,最后发送消息给LMON进程的是SO: 700001405330198,搜索该SO,我们可以发现为LMD进程,如下:

PROCESS 6:
 ----------------------------------------
 SO: 700001405330198, type: 2, owner: 0, flag: INIT/-/-/0x00
 (process) Oracle pid=6, calls cur/top: 7000014054d78a0/7000014054d78a0, flag: (6) SYSTEM
 int error: 0, call error: 0, sess error: 0, txn error 0
 (post info) last post received: 0 0 104
 last post received-location: kjmpost: post lmd
 last process to post me: 700001402307510 1 6
 last post sent: 0 0 24
 last post sent-location: ksasnd
 last process posted by me: 700001407305690 1 6
 (latch info) wait_event=0 bits=0
 Process Group: DEFAULT, pseudo proc: 70000140336dd88
 O/S info: user: oracle, term: UNKNOWN, ospid: 27198128 (DEAD)
 OSD pid info: Unix process pid: 27198128, image: oracle@tpihxdb1 (LMD0)
Dump of memory from 0x07000014072E8460 to 0x07000014072E8668
7000014072E8460 00000007 00000000 07000014 036E30E8  [.............n0.]
......
7000014072E8660 00000000 00000000                    [........]
----------------------------------------
 SO: 7000014008a8d00, type: 20, owner: 700001405330198, flag: -/-/-/0x00
namespace [KSXP] key   = [ 32 31 30 47 45 53 52 30 30 30 00 ]
 ----------------------------------------
 SO: 7000014074a73d8, type: 4, owner: 700001405330198, flag: INIT/-/-/0x00
 (session) sid: 1651 trans: 0, creator: 700001405330198, flag: (51) USR/- BSY/-/-/-/-/-
 DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
 txn branch: 0
 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
 service name: SYS$BACKGROUND
 last wait for 'ges remote message' blocking sess=0x0 seq=25909 wait_time=163023 seconds since wait started=62
 waittime=40, loop=0, p3=0
 Dumping Session Wait History
 for 'ges remote message' count=1 wait_time=163023
 waittime=40, loop=0, p3=0
 .......
 ----------------------------------------
 SO: 7000013c2f51f28, type: 41, owner: 7000014074a73d8, flag: INIT/-/-/0x00
 (dummy) nxc=0, nlb=0
----------------------------------------
 SO: 7000014035cb2f8, type: 11, owner: 700001405330198, flag: INIT/-/-/0x00
 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 700001405330198,
 event: 6, last message event: 70,
 last message waited event: 70,                        next message: 0(0), messages read: 1
 channel: (7000014074b6298) system events broadcast channel
 scope: 2, event: 129420, last mesage event: 70,
 publishers/subscribers: 0/915,
 messages published: 1
 ----------------------------------------
 SO: 7000014036e3ba0, type: 19, owner: 700001405330198, flag: INIT/-/-/0x00
 GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=68
 outq=0 sndq=1 opid=6 prmb=0x0
mbg[i]=(0 55) mbg[b]=(11085 217185741) mbg[r]=(0 0)
 fmq[i]=(30 14) fmq[b]=(20 5) fmq[r]=(0 0)
 mop[s]=224759292 mop[q]=216634842 pendq=0 zmbq=0
 nonksxp_recvs=0
 ------------process 0x7000014036e3ba0--------------------
 proc version      : 0
 Local node        : 0
 pid               : 27198128
 lkp_node          : 0
 svr_mode          : 0
 proc state        : KJP_NORMAL
 Last drm hb acked : 0
 Total accesses    : 31515
 Imm.  accesses    : 31478
 Locks on ASTQ     : 0
 Locks Pending AST : 0
 Granted locks     : 2
 AST_Q:
PENDING_Q:
GRANTED_Q:
KJM HIST LMD0:
7:0 6:1 10:31:0 9:31:3 11:1 15:1 12:78181 7:0 6:0 10:31:1
9:31:2 11:1 15:1 12:78106 7:1 6:0 10:31:0 9:31:2 11:2 15:0
12:78805 7:0 6:0 10:31:1 9:31:2 11:1 15:1 12:78194 7:0 6:0
10:31:1 9:31:2 11:1 15:0 12:78177 7:0 6:0 10:31:1 9:31:2 11:1
15:0 12:78176 7:0 6:1 10:31:0 9:31:2 11:1 15:1 12:78890 7:1
6:0 10:31:0 9:31:2 11:2 15:0 12:78177 7:0 6:1 10:31:0 9:31:3
11:1 15:0 12:78180 7:0
DEFER MSG QUEUE ON LMD0 IS EMPTY
 SEQUENCES:
0:0.0  1:283096258.0
 ----------------------------------------
 SO: 7000014054d78a0, type: 3, owner: 700001405330198, flag: INIT/-/-/0x00
 (call) sess: cur 7000014074a73d8, rec 0, usr 7000014074a73d8; depth: 0
 ----------------------------------------
 SO: 7000014036e30e8, type: 16, owner: 700001405330198, flag: INIT/-/-/0x00
 (osp req holder)
从LMD 进程本身的进程dump信息来看,似乎并无异常。从LMON和LMD进程的process dump来看,进程本身状态是正常的。因此我们可以排除进程挂起导致出现Timeout的可能性。
我们可以看到LMD进程一直在等待ges remote message,很明显这是和另外一个数据库节点进行通信;因此我们要分析问题的根本原因,还需要分析节点2数据库实例的一些信息。
首先我们来分析节点2实例的数据库告警日志,如下:

Tue Jan 13 02:26:13 2015
Thread 2 advanced to log sequence 47410 (LGWR switch)
 Current log# 7 seq# 47410 mem# 0: /redolog/XXXX/redo0701.log
 Current log# 7 seq# 47410 mem# 1: /newredolog/XXXX/redo0703.log
Tue Jan 13 03:39:14 2015
Timed out trying to start process PZ96.
Tue Jan 13 03:44:44 2015
Trace dumping is performing id=[cdmp_20150113034443]
Tue Jan 13 03:44:48 2015
Reconfiguration started (old inc 68, new inc 70)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Jan 13 03:44:48 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed
......
 LMS 5: 0 GCS shadows cancelled, 0 closed
Tue Jan 13 03:44:48 2015
 LMS 9: 3 GCS shadows cancelled, 0 closed
 Set master node info
Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Tue Jan 13 03:44:49 2015
Instance recovery: looking for dead threads
Tue Jan 13 03:44:49 2015
Beginning instance recovery of 1 threads
Tue Jan 13 03:44:50 2015
 LMS 6: 282848 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:50 2015
 LMS 7: 284544 GCS shadows traversed, 0 replayed
.......
Tue Jan 13 03:44:51 2015
 LMS 10: 283658 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:51 2015
 LMS 11: 282777 GCS shadows traversed, 0 replayed
Tue Jan 13 03:44:51 2015
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Tue Jan 13 03:44:54 2015
 parallel recovery started with 16 processes
Tue Jan 13 03:44:55 2015
Started redo scan
Tue Jan 13 03:44:55 2015
Completed redo scan
 281591 redo blocks read, 4288 data blocks need recovery
Tue Jan 13 03:44:56 2015
Started redo application at
 Thread 1: logseq 47935, block 1974207
Tue Jan 13 03:44:56 2015
Recovery of Online Redo Log: Thread 1 Group 4 Seq 47935 Reading mem 0
 Mem# 0: /redolog/XXXX/redo0401.log
 Mem# 1: /newredolog/XXXX/redo0403.log
Tue Jan 13 03:44:56 2015
Recovery of Online Redo Log: Thread 1 Group 5 Seq 47936 Reading mem 0
 Mem# 0: /redolog/XXXX/redo0501.log
 Mem# 1: /newredolog/XXXX/redo0503.log
Tue Jan 13 03:44:57 2015
Completed redo application
Tue Jan 13 03:44:57 2015
Completed instance recovery at
 Thread 1: logseq 47936, block 270263, scn 6869096106270
 4253 data blocks read, 4901 data blocks written, 281591 redo blocks read
Tue Jan 13 03:44:57 2015
Thread 1 advanced to log sequence 47937 (thread recovery)
Tue Jan 13 03:44:57 2015
Redo thread 1 internally disabled at seq 47937 (SMON)
Tue Jan 13 03:44:58 2015
ARC1: Archiving disabled thread 1 sequence 47937
Tue Jan 13 03:44:59 2015
Thread 2 advanced to log sequence 47411 (LGWR switch)
 Current log# 8 seq# 47411 mem# 0: /redolog/XXXX/redo0801.log
 Current log# 8 seq# 47411 mem# 1: /newredolog/XXXX/redo0803.log
Tue Jan 13 03:45:09 2015
SMON: Parallel transaction recovery tried
Tue Jan 13 03:45:53 2015
Reconfiguration started (old inc 70, new inc 72)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 0
Tue Jan 13 03:45:53 2015
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Jan 13 03:45:53 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed
从节点2的数据库告警日志来看,在3:44:48时间点,开始进行实例的reconfig操作,这与整个故障的时间点是符合的。告警日志中本身并无太多信息,我们接着分析节点2数据库实例的LMON进程trace信息:

*** 2015-01-13 03:18:53.006
Begin DRM(82933)
 sent syncr inc 68 lvl 84937 to 0 (68,0/31/0)
synca inc 68 lvl 84937 rcvd (68.0)
sent syncr inc 68 lvl 84938 to 0 (68,0/34/0)
......
sent syncr inc 68 lvl 84968 to 0 (68,0/38/0)
synca inc 68 lvl 84968 rcvd (68.0)
End DRM(82933)
*** 2015-01-13 03:23:55.896
Begin DRM(82934)
 sent syncr inc 68 lvl 84969 to 0 (68,0/31/0)
synca inc 68 lvl 84969 rcvd (68.0)
......
sent syncr inc 68 lvl 85000 to 0 (68,0/38/0)
synca inc 68 lvl 85000 rcvd (68.0)
End DRM(82934)
*** 2015-01-13 03:29:00.374
Begin DRM(82935)
 sent syncr inc 68 lvl 85001 to 0 (68,0/31/0)
synca inc 68 lvl 85001 rcvd (68.0)
......
 sent syncr inc 68 lvl 85011 to 0 (68,0/36/0)
synca inc 68 lvl 85011 rcvd (68.0)
*** 2015-01-13 03:29:10.511
 sent syncr inc 68 lvl 85012 to 0 (68,0/38/0)
synca inc 68 lvl 85012 rcvd (68.0)
......
sent syncr inc 68 lvl 85020 to 0 (68,0/38/0)
synca inc 68 lvl 85020 rcvd (68.0)
*** 2015-01-13 03:44:45.191
kjxgmpoll reconfig bitmap: 1
*** 2015-01-13 03:44:45.191
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 68 0.
*** 2015-01-13 03:44:45.222
 Name Service frozen
kjxgmcs: Setting state to 68 1.
kjxgfipccb: msg 0x110fffe78, mbo 0x110fffe70, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (1416,80)
kjxgfipccb: msg 0x110ffa0b8, mbo 0x110ffa0b0, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (944,80)
kjxgfipccb: msg 0x11113be68, mbo 0x11113be60, type 22, ack 0, ref 0, stat 34
kjxgfipccb: Send cancelled, stat 34 inst 0, type 22, tkt (472,80)
kjxgrssvote: reconfig bitmap chksum 0xd7682cca cnt 1 master 1 ret 0
kjxggpoll: change poll time to 50 ms
* kjfcln: DRM aborted due to CGS rcfg.
* ** 2015-01-13 03:44:45.281
从上述LMON进程的日志来看,在故障时间点之前,数据库一直存在大量的DRM操作。上述红色部分的信息十分关键,首先节点进行reconfig时,reason 代码值为1.  关于reason值,Oracle Metalink文档有如下描述:

Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

从reason =1 来看,数据库实例被强行终止重启也不是通信故障的问题,如果是通信的问题,那么reason值通常应该等于3. reason=1表明这是数据库节点自身监控时触发的reconfig操作。
同时我们从* kjfcln: DRM aborted due to CGS rcfg. 这段关键信息也可以确认,CGS reconfig的原因也正是由于DRM操作失败导致。同时,我们也可以看到,在3:29分开始的Begin DRM(82935)操作,一直到3:44出现故障时,这个DRM操作都没有结束(如果结束,会出现End DRM(82935) 类似关键字)。
由此也不难看出,实际上,该集群数据库可能在3:29之后就已经出现问题了。这里简单补充Oracle DRM的原理:
在Oracle RAC环境中,当某个节点对某个资源访问频率较高时,而该资源的master节点是不是local节点时,那么可能会触发DRM操作,DRM即为:Dynamic Resource Management。在Oracle 10gR1引入该特性之前,如果数据库需要更改某个资源的master节点,那么必须将数据库实例重启来完成。很显然,这一特性的引入无疑改变了一切。同时,从Oracle 10gR2开始,又引入了基于object/undo 级别的affinity。这里所谓的affinity,本质上是引入操作系统的概念,即对某个对象的亲和力程度;在数据库来看,即为对某个对象的访问频率程度。
在Oracle 10gR2版本中,默认情况下,当某个对象的被访问频率超过50时,而同时该对象的master又是其他节点时,那么Oracle则会触发DRM操作。在进程DRM操作的过程中,Oracle会将该资源的相关信息进行临时frozen,然后将该资源在其他节点进行unfrozen,然后更改资源的master节点。注意,这里临时frozen的资源其实是GRD(Global Resource Directory)中的资源。在整个DRM的过程之中,访问该资源的进程都将被临时挂起。因此,当系统出现DRM时,是很可能导致系统或进程出现异常的。
根据Oracle 文档的描述,当DRM触发较为频繁时,是很可能导致出现SYNC Timeout的,如下:


Bug 6960699 - "latch: cache buffers chains" contention/ORA-481/kjfcdrmrfg: SYNC TIMEOUT/ OERI[kjbldrmrpst:!master] (ID 6960699.8)
 
Dynamic ReMastering (DRM) can be too aggressive at times causing any combination
of the following symptoms :
 
- "latch: cache buffers chains" contention.
- "latch: object queue header operation" contention
- a RAC node can crash with and ora-481 / kjfcdrmrfg: SYNC TIMEOUT ... step 31
- a RAC node can crash with OERI[kjbldrmrpst:!master]

因此,我们认为此次故障的原因本质上就是因为Oracle DRM的异常导致了相关RAC核心进程的异常,最终导致了数据库实例被强行终止(当然,这本质上是为了保证数据的一致性)。目前客户已经屏蔽DRM,已经运行一周,暂时没有发现任何问题,有待进一步观察!

相关文章

精彩推荐