Ora-00494 / Aiowait timeout cause instance crashed.(SYSTEM DUMP分析案例)

作者:袖梨 2022-06-29

报错如下:

Wed Oct 21 03:06:22 2015
Thread 1 advanced to log sequence 254206 (LGWR switch)
  Current log# 9 seq# 254206 mem# 0: /boss_sysdata/oradata/dtvboss/redo09.log
Wed Oct 21 03:17:10 2015
WARNING: aiowait timed out 1 times
Wed Oct 21 03:22:11 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367'
Wed Oct 21 03:22:17 2015
System State dumped to trace file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Wed Oct 21 03:22:50 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367'
Wed Oct 21 03:22:55 2015
Killing enqueue blocker (pid=367) on resource CF-00000000-00000000
 by killing session 1648.1
Wed Oct 21 03:22:55 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-07445: exception encountered: core dump [ksuklms()+672] [SIGSEGV] [Address not mapped to object] [0x000000062] [] []
Wed Oct 21 03:23:18 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc:
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:18 2015
PMON: terminating instance due to error 470
Wed Oct 21 03:23:18 2015
ORA-470 encountered when generating server alert SMG-3503
Wed Oct 21 03:23:18 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_j006_22335.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:22 2015
Termination issued to instance processes. Waiting for the processes to exit
Wed Oct 21 03:23:28 2015
Instance termination failed to kill one or more processes
Wed Oct 21 03:23:58 2015
Instance terminated by PMON, pid = 343
Wed Oct 21 03:24:06 2015
Shutting down instance (abort)
Wed Oct 21 03:25:18 2015
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 21 03:25:18 2015
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 21 03:25:18 2015


1.宕机的起因分析:

告警日志分析:
Wed Oct 21 03:06:22 2015
Thread 1 advanced to log sequence 254206 (LGWR switch)
Current log# 9 seq# 254206 mem# 0: /boss_sysdata/oradata/dtvboss/redo09.log
Wed Oct 21 03:17:10 2015
WARNING: aiowait timed out 1 times –该时间段系统出现aio等待的告警,该告警由_iowait_timeouts隐含参数控制,超过该参数设置的值后会导致系统宕机,这里并未超过系统设置值。但是提示我们当时系统在io调度上出现了问题。
Wed Oct 21 03:22:11 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 367′ –22分17秒系统抛错,进程367持有controlfile的enqueue lock时间被其他进程(arc0)等待超时,该enqueue lock持有时间由隐含参数_controlfile_enqueue_timeout控制,默认为900秒
Wed Oct 21 03:22:17 2015
System State dumped to trace file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Wed Oct 21 03:22:50 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 367′ –22分50秒,进程367持有的controlfile的enqueue lock时间被其他进程(lgwr)等待超时抛错
Wed Oct 21 03:22:55 2015
Killing enqueue blocker (pid=367) on resource CF-00000000-00000000 –LGWR进程发现异常查杀持有controlfile equeue lock的进程367
by killing session 1648.1
Wed Oct 21 03:22:55 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-07445: exception encountered: core dump [ksuklms()+672] [SIGSEGV] [Address not mapped to object] [0x000000062] [] []
Wed Oct 21 03:23:18 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc: –后续一系列报错宕机开始
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:18 2015
PMON: terminating instance due to error 470
Wed Oct 21 03:23:18 2015
ORA-470 encountered when generating server alert SMG-3503
Wed Oct 21 03:23:18 2015
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_j006_22335.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:22 2015
Termination issued to instance processes. Waiting for the processes to exit
Wed Oct 21 03:23:28 2015
Instance termination failed to kill one or more processes

Trace文件分析(trace过长摘取关键部分):
*** SERVICE NAME:(SYS$BACKGROUND) 2015-10-21 03:22:11.264
*** SESSION ID:(1646.3) 2015-10-21 03:22:11.264
*** 2015-10-21 03:22:11.264
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=367 sid=1648 sser=1 time_held=37 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
possible owner[17.367] on resource CF-00000000-00000000
Dumping process info of pid[17.367] requested by pid[23.404]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by ‘inst 1, osid 367′

——————————————————————————-
ENQUEUE [CF] HELD FOR TOO LONG

enqueue holder: ‘inst 1, osid 367′

Process ‘inst 1, osid 367′ is holding an enqueue for maximum allowed time.
The process will be terminated.

Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process ‘inst 1, osid 367′ located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).

Dumping process 17.367 info:
*** 2015-10-21 03:22:11.300
Dumping diagnostic information for ospid 367:
OS pid = 367
loadavg : 0.44 0.62 1.82
swap info: free_mem = 1435.35M rsv = 59461.24M
alloc = 55941.14M avail = 55124.91 swap_free = 58645.01M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 367 1 0 39 20 ? 4116160 ? Mar 30 ? 498:14 ora_ckpt_dtvboss — 整个系统当时的空闲内存和swap还留有余地,足够当时系统的使用,所以内存不足的原因可以排除。这里提示进程367为oracle ckpt进程,可以确认当时持有controlfile enqueue lock的进程为oracle ckpt进程

PROCESS 23: –该进程为arc0 归档进程,trace信息显示在等待controlfile enqueue 队列锁。等待事件为enq: CF – contention
—————————————-
SO: 8a9004f88, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=23, calls cur/top: 8a95c6650/8a95c6650, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: b66002018 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: b66005770 1 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 404
OSD pid info: Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)
Dump of memory from 0x00000005F033B1F0 to 0x00000005F033B3F8
5F033B1F0 0000000C 00000000 00000008 AF739C70 [………….s.p]
5F033B200 00000006 000313A7 00000008 A95D1620 [………….]. ]
5F033B210 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B220 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B230 00000006 000313A7 00000008 A95D1FB8 [………….]..]
5F033B240 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B250 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B260 00000006 000313A7 00000008 A95D1FB8 [………….]..]
5F033B270 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B280 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B290 00000006 000313A7 00000008 A95D1FB8 [………….]..]
5F033B2A0 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B2B0 00000006 000313A7 00000008 A95D17E8 [………….]..]
5F033B2C0 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B2D0 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B2E0 00000006 000313A7 00000008 A95D1880 [………….]..]
5F033B2F0 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B300 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B310 00000006 000313A7 00000008 A95D16B8 [………….]..]
5F033B320 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B330 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B340 00000006 000313A7 00000008 A95D1880 [………….]..]
5F033B350 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B360 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B370 00000006 000313A7 00000008 A95D16B8 [………….]..]
5F033B380 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B390 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B3A0 00000006 000313A7 00000008 A95D17E8 [………….]..]
5F033B3B0 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B3C0 00000006 000313A7 00000008 AF739C70 [………….s.p]
5F033B3D0 00000006 000313A7 00000008 A95D17E8 [………….]..]
5F033B3E0 00000005 000313A7 00000008 AF739C70 [………….s.p]
5F033B3F0 00000006 000313A7 [……..]
(FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control03.ctl
fno=2 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control02.ctl
fno=1 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control01.ctl
fno=0 lblksz=16384 fsiz=1308
—————————————-
SO: 8a9778500, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,
event: 21, last message event: 21,
last message waited event: 21, messages read: 0
channel: (5eb52cc20) scumnt mount lock
scope: 1, event: 22, last mesage event: 20,
publishers/subscribers: 0/20,
messages published: 1
—————————————-
SO: 8a95b7328, type: 4, owner: 8a9004f88, flag: INIT/-/-/0x00
(session) sid: 1646 trans: 0, creator: 8a9004f88, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0017-0000000D, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
last wait for ‘enq: CF – contention’ blocking sess=0x0 seq=43322 wait_time=2929723 seconds since wait started=36
name|mode=43460004, 0=0, operation=0
Dumping Session Wait History
for ‘enq: CF – contention’ count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929724
name|mode=43460004, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929750
name|mode=43460004, 0=0, operation=0

SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, lock_flag: 0x0
own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8 –这里说明arc0在申请0x5eb40f608的队列锁资源。而在process 17号的ckpt中可以看到0x5eb40f608资源被ckpt持有

process 17(ckpt进程)中的句柄:

SO: 8a95b9df8, type: 4, owner: 8a90037d0, flag: INIT/-/-/0x00
(session) sid: 1648 trans: 0, creator: 8a90037d0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0011-00000005, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for ‘direct path write’ blocking sess=0x0 seq=39369 wait_time=0 seconds since wait started=37
file number=72, first dba=1, block cnt=1
Dumping Session Wait History
for ‘direct path write’ count=1 wait_time=0
file number=73, first dba=1, block cnt=1
for ‘direct path write’ count=1 wait_time=0
file number=74, first dba=1, block cnt=1
for ‘direct path write’ count=1 wait_time=0
file number=75, first dba=1, block cnt=1
for ‘direct path write’ count=1 wait_time=0
file number=76, first dba=1, block cnt=1
for ‘direct path write’ count=1 wait_time=309
file number=77, first dba=1, block cnt=1
for ‘direct path read’ count=1 wait_time=1
file number=8b, first dba=1, block cnt=1
for ‘direct path read’ count=1 wait_time=0
file number=8a, first dba=1, block cnt=1
for ‘direct path read’ count=1 wait_time=0
file number=89, first dba=1, block cnt=1
for ‘direct path read’ count=1 wait_time=1
file number=88, first dba=1, block cnt=1
for ‘direct path read’ count=1 wait_time=1
file number=87, first dba=1, block cnt=1
temporary object counter: 0
—————————————-
SO: 8a95d1cc0, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0011-00000005
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, mode: SSX, lock_flag: 0x0 —————-ckpt以ssx的方式持有了该资源
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f618

PROCESS 10: LGWR进程
—————————————-
SO: b66002018, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=10, calls cur/top: b665c6de0/b665c6de0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: 8a9000860 1 6
last post sent: 109 0 4
last post sent-location: kslpsr
last process posted by me: 8a902ef90 105 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 365
OSD pid info: Unix process pid: 365, image: oracle@hzboss-ora1 (LGWR)

SO: b665bf398, type: 4, owner: b66002018, flag: INIT/-/-/0x00
(session) sid: 824 trans: 0, creator: b66002018, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-000A-00000006, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for ‘enq: CF – contention’ blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0 0x8a95b9df8该句柄持有者为process 17的ckpt进程
name|mode=43460005, 0=0, operation=0
Dumping Session Wait History
for ‘enq: CF – contention’ count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929718
name|mode=43460005, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for ‘enq: CF – contention’ count=1 wait_time=2929719

—————————————-
SO: b665c6de0, type: 3, owner: b66002018, flag: INIT/-/-/0x00
(call) sess: cur b665bf398, rec 0, usr b665bf398; depth: 0
—————————————-
SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-000A-00000006
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, req: SSX, lock_flag: 0x0 ————前面已经说明了,该5eb40f608的资源队列锁已经被ckpt进程持有,而lgwr正在申请以ssx的方式持有。
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628

综上可以得出结论,由于arc0进程要写日志需要等待控制文件CF锁,lgwr估计要写日志或者切换也在等待控制文件CF锁,而ckpt进程持有controlfile enqueue lock时间过长,arch0等待超过900秒,导致oracle数据库系统判断异常。
此时oracle数据库系统会通知lgwr进程kill持有controlfile enqueue lock的进程,相当于lgwr进程杀了ckpt进程,而ckpt进程又为oracle关键进程,连锁反应,引起宕机。

相关文章

精彩推荐