我写关于12.2 sharding database的文章已经好久了,今天再次把环境启动了起来,启动了主机之后,依次启动了listener和shardcat数据库和shard node数据库实例。检查shard状态的时候,发现报错:
(a)问题一:status显示warnings:
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok Deployed region1 ONLINE
sh2 primary_shardgroup Warnings Deployed region1 -
GDSCTL>
遇到这样的情况,我们可以使用recover shard的命令。注意recover shard -h出来的帮助是错误的。
GDSCTL>recover shard -h
Syntax
RECOVER SHARD -gdspool pool -database db_name [-skip_first|-ignore_first]
[-full]
Purpose
Executes all DDL statements on specified database starting from the one, that
was previusly executed with errors. The command is intended to perform all
skipped DDL changes after database administrator fixes shard issues.
Usage Notes
Use SKIP_FIRST to skip first DDL. This is typically required after manual fix
done by database administrator. For example, if CREATE TABLE statement fails
because of lack of space, database administrator fixes the issue and re-executes
CREATE TABLE. To avoid ORA-39151 (table exists)in RECOVER SHARD he has to
specify -SKIP_FIRST.
Use IGNORE_FIRST to mark first DDL as obsolete. This is required when wrong DDL
statement was specified and failed on all shards. Thus you need to mark it down
as obsolete.
Keywords and Parameters
full: Full recovery mode.
gdspool: the GDS pool (If not specified and there is only one gdspool with access
granted to user, it will be used by default)
ignore_first: make first failed DDL statement obsolete.
shard: The name of the shard.
skip_first: skip first failed DDL statement
Examples
RECOVER SHARD -database shd1
GDSCTL>
正确的用法应该是recover shard -shard 。
我们运行这个命令,发现是gsm没有启动:
GDSCTL>recover shard -shard sh2
GSM-45076: GSM is not running
GDSCTL>
GDSCTL>start gsm
GSM is started successfully
GDSCTL>
启动后,稍等片刻(大约10秒),检查状态就恢复成正常了(在这里,我们还没有真正用到recover shard,在下面一个问题时才正常用到):
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok Deployed region1 ONLINE
sh2 primary_shardgroup Warnings Deployed region1 -
GDSCTL>
GDSCTL>--等待大约10秒
GDSCTL>
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok Deployed region1 ONLINE
sh2 primary_shardgroup Ok Deployed region1 ONLINE
GDSCTL>
所以,重新启动之后,需要启动的有database,listener,和gsm。
(b)问题二:state显示ddl error
然后,在接着的实验中,我drop掉shardcat上的一个duplicate table,发现此时shard状态又不正常了。config shard的state显示ddl error。
首先,在shardcat上操作,drop掉duplicate表
SQL> conn app_schema/oracle
Connected.
SQL>
SQL> drop table products;
drop table products
*
ERROR at line 1:
ORA-02557: cannot operate on sharded objects when shard DDL is disabled
SQL>
SQL>
SQL>
SQL> alter session enable shard ddl;
Session altered.
SQL> select count(*) from products;
COUNT(*)
----------
1000
SQL>
SQL> drop table products;
Table dropped.
SQL>
发现此时数据并没有同步,在shard node上,还是可以查询到数据:
SQL> !hostname
sdb2
SQL>
SQL> conn app_schema/oracle
Connected.
SQL>
SQL> select count(*) from products;
COUNT(*)
----------
1000
SQL>
SQL> !hostname
sdb3
SQL>
SQL> conn app_schema/oracle
Connected.
SQL>
SQL> select count(*) from products;
COUNT(*)
----------
1000
SQL>
此时检查shard状态,变成DDL error:
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok DDL error region1 ONLINE
sh2 primary_shardgroup Ok DDL error region1 ONLINE
GDSCTL>
我们可以在gsm的alertlog中看到一些提示信息:
(注:gsm的日志位置在/u01/ora12c/app/oracle/diag/gsm/sdb1/sharddirector1/trace/alert_shardcat.log)
……
07-NOV-2016 01:35:19 * (CONNECT_DATA=(SERVICE_NAME=GDS$CATALOG.oradbcloud)(CID=(PROGRAM=oracle)(HOST=sdb3)(USER=oracle12c))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.23)(PORT=34247)) * establish * GDS$CATALOG.oradbcloud * 0
2016-11-07T01:35:22.021482+08:00
07-NOV-2016 01:35:22 * service_update * shardcat%11 * 0
07-NOV-2016 01:35:22 * service_update * shardcat%1 * 0
2016-11-07T01:35:35.543868+08:00
GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"33". Pool- - "". Id="72".
Payload:"A"
GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"34". Pool- - "". Id="73".
Payload:"E"
2016-11-07T01:35:36.652686+08:00
GSM-40148: Database task failed:database:"sh1", status 4, message:"ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)
"
GSM-40148: Database task failed:database:"sh2", status 4, message:"ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)
"
2016-11-07T01:35:36.700593+08:00
GSM-40157: GSM response Id=72 Type=80. Status:"Success with info".
2016-11-07T01:35:37.010475+08:00
GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh2". Pool- - "". Id="74".
Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))"
GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh1". Pool- - "". Id="75".
Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))"
GSM-40157: GSM response Id=73 Type=80. Status:"Success".
2016-11-07T01:35:41.834561+08:00
07-NOV-2016 01:35:41 * service_update * SHARDDIRECTOR1 * 0
……
其实,从这个log中,我们只是看到有ORA-01031 insufficient privileges的报错,但究竟是什么权限不足,我们还是得通过config shard命令来进一步看:
GDSCTL>config shard -shard sh1
Name: sh1
Shard Group: primary_shardgroup
Status: Ok
State: Deployed
Region: region1
Connection string: sdb2:1521/sh1:dedicated
SCAN address:
ONS remote port: 0
Disk Threshold, ms: 20
CPU Threshold, %: 75
Version: 12.2.0.0
Last Failed DDL: alter database link "PRODUCTSD...
DDL Error: ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)
Failed DDL id: 33
Availability: ONLINE
Supported services
------------------------
Name Preferred Status
---- --------- ------
oltp_rw_srvc Yes Enabled
GDSCTL>
我们看到这个last failed ddl,是alter database link的时候出现权限不足。
所以我们找到了原因,进行修复:
--在sh1上:
[oracle12c@sdb2 ~]$ sqlplus "/ as sysdba"
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:50:25 2016
Copyright (c) 1982, 2015, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> grant alter database link to app_schema;
Grant succeeded.
SQL>
--在sh2上
[oracle12c@sdb3 ~]$ sqlplus "/ as sysdba"
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:51:13 2016
Copyright (c) 1982, 2015, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> grant alter database link to app_schema;
Grant succeeded.
SQL>
--在shardcat上,gsm_env下:
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok DDL error region1 ONLINE
sh2 primary_shardgroup Ok DDL error region1 ONLINE
GDSCTL>
GDSCTL>recover shard -shard sh1
GDSCTL>
GDSCTL>recover shard -shard sh2
GDSCTL>
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok Deployed region1 ONLINE
sh2 primary_shardgroup Ok Deployed region1 ONLINE
GDSCTL>
--在sh1上:
[oracle12c@sdb2 ~]$ sqlplus "/ as sysdba"
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:50:25 2016
Copyright (c) 1982, 2015, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> grant alter database link to app_schema;
Grant succeeded.
SQL>
--在sh2上
[oracle12c@sdb3 ~]$ sqlplus "/ as sysdba"
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:51:13 2016
Copyright (c) 1982, 2015, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> grant alter database link to app_schema;
Grant succeeded.
SQL>
--在shardcat上,gsm_env下:
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok DDL error region1 ONLINE
sh2 primary_shardgroup Ok DDL error region1 ONLINE
GDSCTL>
GDSCTL>recover shard -shard sh1
GDSCTL>
GDSCTL>recover shard -shard sh2
GDSCTL>
GDSCTL>config shard
Name Shard Group Status State Region Availability
---- ----------- ------ ----- ------ ------------
sh1 primary_shardgroup Ok Deployed region1 ONLINE
sh2 primary_shardgroup Ok Deployed region1 ONLINE
GDSCTL>
可以看到,在修复问题的基础上,运行recover shard -shard 之后,就恢复了正常。
注意需要注意的是,如果root cause没找到,不知道是database link的权限问题,仅仅运行recover shard是没有效果的。