这是友情支持一个朋友的数据库恢复case,昨天圣诞节的时候来个求助,只能速度帮忙解决了好过节去了。首先我们来看下的alert log都有哪些信息:
Wed Dec 23 15:19:21 2015
SMON: enabling tx recovery
Wed Dec 23 15:19:21 2015
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
where NUMA PG = 2, CPUs = 8
Wed Dec 23 15:19:22 2015
Errors in file d:oracleproduct10.2.0adminajzqudumpajzq_ora_1712.trc:
ORA-00600: internal error code, arguments: [4194], [45], [45], [], [], [], [], []
Doing block recovery for file 2 block 1618943
Block recovery from logseq 4, block 63 to scn 7662329350
Wed Dec 23 15:19:23 2015
Recovery of Online Redo Log: Thread 1 Group 1 Seq 4 Reading mem 0
Mem# 0: D:ORACLEPRODUCT10.2.0ORADATAxxxxREDO01.LOG
Block recovery completed at rba 4.4749.16, scn 1.3367362055
Doing block recovery for file 2 block 121
Block recovery from logseq 4, block 63 to scn 7662329346
Wed Dec 23 15:19:23 2015
Recovery of Online Redo Log: Thread 1 Group 1 Seq 4 Reading mem 0
Mem# 0: D:ORACLEPRODUCT10.2.0ORADATAxxxxREDO01.LOG
Block recovery completed at rba 4.3977.16, scn 1.3367362051
Wed Dec 23 15:19:24 2015
SMON: Restarting fast_start parallel rollback
Wed Dec 23 15:19:24 2015
Errors in file d:oracleproduct10.2.0adminxxxxudumpxxxx_ora_1712.trc:
ORA-00600: internal error code, arguments: [4194], [45], [45], [], [], [], [], []
Wed Dec 23 15:19:24 2015
DEBUG: Replaying xcb 0xf0adfbe0, pmd 0x20588f48 for failed op 8
Doing block recovery for file 2 block 1618943
Block recovery from logseq 4, block 63 to scn 7662329350
Wed Dec 23 15:19:24 2015
Recovery of Online Redo Log: Thread 1 Group 1 Seq 4 Reading mem 0
Mem# 0: D:ORACLEPRODUCT10.2.0ORADATAxxxxREDO01.LOG
Block recovery completed at rba 4.4749.16, scn 1.3367362055
Wed Dec 23 15:21:28 2015
Errors in file d:oracleproduct10.2.0adminxxxxudumpxxxx_ora_1712.trc:
ORA-00600: internal error code, arguments: [4194], [45], [45], [], [], [], [], []
ORA-00600: internal error code, arguments: [4194], [45], [45], [],
.....
Wed Dec 23 16:47:26 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_p005_8980.trc:
ORA-00600: internal error code, arguments: [2037], [141460651], [141460651], [162], [6], [1], [3367403517], [3141222167]
Wed Dec 23 16:47:26 2015
Hex dump of (file 92, block 3926982) in trace file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw0_8864.trc
Corrupt block relative dba: 0x173bebc6 (file 92, block 3926982)
Bad header found during preparing block for write
Data in bad block:
type: 23 format: 7 rdba: 0xbd3b3f17
last change scn: 0x0001.c8b6be51 seq: 0x5 flg: 0x00
spare1: 0x3b spare2: 0xbc spare3: 0xc
consistency value in tail: 0xbe511705
check value in block header: 0x0
block checksum disabled
Wed Dec 23 16:47:26 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_p005_8980.trc:
ORA-00600: internal error code, arguments: [2037], [141460651], [141460651], [162], [6], [1], [3367403517], [3141222167]
Wed Dec 23 16:47:27 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw0_8864.trc:
ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], []
Wed Dec 23 16:47:27 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw0_8864.trc:
ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], []
DBW0: terminating instance due to error 471
我们可以看到,这个数据库已经宕机好几天了,在23号的时候就已经无法正常open了。上述的结果错误也很常见很简单。朋友经过处理后顺利把数据库open了,可是打开之后创建新的undo 表空间之后尝试重启后就再也打开不开了。遇到了如下的错误:
Fri Dec 25 11:44:17 2015
Recovery of Online Redo Log: Thread 1 Group 3 Seq 6 Reading mem 0
Mem# 0: D:ORACLEPRODUCT10.2.0ORADATAxxxxREDO03.LOG
Fri Dec 25 11:44:17 2015
Completed redo application
Fri Dec 25 11:44:17 2015
Hex dump of (file 44, block 518322) in trace file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw2_21232.trc
Corrupt block relative dba: 0x0b07e8b2 (file 44, block 518322)
Bad header found during preparing block for write
Data in bad block:
type: 82 format: 2 rdba: 0x1c4ee895
last change scn: 0x0001.c8bbad54 seq: 0x21 flg: 0xf0
spare1: 0x52 spare2: 0x4d spare3: 0x895b
consistency value in tail: 0xad545221
check value in block header: 0x0
block checksum disabled
Fri Dec 25 11:44:18 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw2_21232.trc:
ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], []
Fri Dec 25 11:44:18 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_dbw2_21232.trc:
ORA-00600: internal error code, arguments: [kcbzpbuf_1], [4], [1], [], [], [], [], []
DBW2: terminating instance due to error 471
Fri Dec 25 11:44:18 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_p014_20704.trc:
ORA-00471: DBWR process terminated with error
这个错误我也是第一次遇见。从kcbzpbuf函数来看,是dbwr进出写脏块时发现了坏块。很明显,从上面的坏块信息来看,dba地址0x0b07e8b2 和 rdba地址0x1c4ee895是不匹配,这肯定没法正常写入的。
针对这个错误我事后在metalink 搜索了一下,还是有不少相关的文档甚至是bug,这里不多说了。处理方式很简单,首先加入如下2个隐含参数尝试打开数据库:
*._allow_resetlogs_corruption=TRUE
*._allow_error_simulation=TRUE
做了不完全恢复后,尝试open resetlogs时发现遇到了如下错误:
ri Dec 25 12:20:03 2015
SMON: enabling cache recovery
Fri Dec 25 12:20:04 2015
Errors in file d:oracleproduct10.2.0adminxxxxudumpxxxx_ora_20176.trc:
ORA-00600: internal error code, arguments: [2662], [1], [3367742663], [1], [3367742798], [4194313], [], []
Fri Dec 25 12:20:05 2015
Errors in file d:oracleproduct10.2.0adminxxxxudumpxxxx_ora_20176.trc:
ORA-00600: internal error code, arguments: [2662], [1], [3367742663], [1], [3367742798], [4194313], [], []
Fri Dec 25 12:20:05 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Fri Dec 25 12:20:05 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_p015_21472.trc:
ORA-00600: internal error code, arguments: [15784], [600], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], []
Fri Dec 25 12:20:05 2015
Errors in file d:oracleproduct10.2.0adminxxxxbdumpxxxx_p014_9072.trc:
ORA-00600: internal error code, arguments: [15784], [600], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], []
看到这里2个错误,很多人可能会很迷惑,到底数据库无法open是因为哪个错误导致的呢? 这里稍微注意一下前后报错的顺序就知道了,很明显是前面的ora-00600 [2662]错误导致无法open resetlogs完成。
对于ora-00600 [2662]错误这里不再解释了,SCN的问题。这里直接推进scn即可。
将数据库启动到mount状态,执行如下命令:
SQL> alter system set job_queue_processes=0;
Session altered.
SQL> startup mount
ORACLE instance started.
Total System Global Area 7516192768 bytes
Fixed Size 2175728 bytes
Variable Size 1232850192 bytes
Database Buffers 6274678784 bytes
Redo Buffers 6488064 bytes
Database mounted.
SQL> alter session set events 'IMMEDIATE trace name adjust_scn level 10';
Session altered.
SQL> alter database open;
Database altered.
很顺利就打开了数据库。由于之前朋友已经创建了新的undo表空间,我只需要帮忙把undo重建就完事了,结束支持任务。尝试进行drop tablespace undotbs1 including contents and datafiles时,遇到ora-01548错误。
在alter session set “_smu_debug_mode” = 4; 后都无法进行drop,那只能继续使用下隐含参数了:
_offline_rollback_segments
_corrupted_rollback_segments
通过这2个隐含参数,可以顺利将旧的undo 表空间drop掉,然后修改pfile文件去掉不需要的一些隐含参数,顺利打开数据库。
到这里结束我的半小时支持友情支持任务!悲剧的下午又开始优化某客户的SQL。。。。。
备注:
1) 对于这种异常强制打开的数据库,建议进行全库检查,包括是否有坏块等,甚至检查数据字典是否一致等等(实际上检查之前的alert log发现, 之前数据库在恢复过程中就报了不少的坏块)
2) 在这个年代,还有3TB的数据库,而且是windows,更悲剧的还是非归档,没有备份,这是不应该的。
3) 我们应该多考虑下系统的容灾建设,哪怕是有备份。