oracle中config shard的状态不正常的处理

作者:袖梨 2022-06-29

我写关于12.2 sharding database的文章已经好久了,今天再次把环境启动了起来,启动了主机之后,依次启动了listener和shardcat数据库和shard node数据库实例。检查shard状态的时候,发现报错:

(a)问题一:status显示warnings:


GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
sh2                 primary_shardgroup  Warnings  Deployed    region1   -           
 
GDSCTL>

遇到这样的情况,我们可以使用recover shard的命令。注意recover shard -h出来的帮助是错误的。

GDSCTL>recover shard -h
Syntax
RECOVER SHARD -gdspool pool -database db_name [-skip_first|-ignore_first]      
[-full]                                                                        
 
Purpose
Executes all DDL statements on specified database starting from the one, that  
was previusly executed with errors. The command is intended to perform all     
skipped DDL changes after database administrator fixes shard issues.           
 
Usage Notes
Use SKIP_FIRST to skip first DDL. This is typically required after manual fix  
done by database administrator. For example, if CREATE TABLE statement fails   
because of lack of space, database administrator fixes the issue and re-executes
CREATE TABLE. To avoid ORA-39151 (table exists)in RECOVER SHARD he has to      
specify -SKIP_FIRST.                                                           
 
Use IGNORE_FIRST to mark first DDL as obsolete. This is required when wrong DDL
statement was specified and failed on all shards. Thus you need to mark it down
as obsolete.                                                                   
 
Keywords and Parameters
full:         Full recovery mode.
gdspool:      the GDS pool (If not specified and there is only one gdspool with access
              granted to user, it will be used by default)
ignore_first: make first failed DDL statement obsolete.
shard:        The name of the shard.
skip_first:   skip first failed DDL statement
 
Examples
RECOVER SHARD -database shd1
 
GDSCTL>
正确的用法应该是recover shard -shard 。

我们运行这个命令,发现是gsm没有启动:


GDSCTL>recover shard -shard sh2
GSM-45076: GSM is not running
GDSCTL>
GDSCTL>start gsm
GSM is started successfully
GDSCTL>

启动后,稍等片刻(大约10秒),检查状态就恢复成正常了(在这里,我们还没有真正用到recover shard,在下面一个问题时才正常用到):


GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
sh2                 primary_shardgroup  Warnings  Deployed    region1   -           
 
GDSCTL>
GDSCTL>--等待大约10秒
GDSCTL>
GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
sh2                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
 
GDSCTL>
所以,重新启动之后,需要启动的有database,listener,和gsm。

(b)问题二:state显示ddl error
然后,在接着的实验中,我drop掉shardcat上的一个duplicate table,发现此时shard状态又不正常了。config shard的state显示ddl error。

首先,在shardcat上操作,drop掉duplicate表


SQL> conn app_schema/oracle
Connected.
SQL>
SQL> drop table products;
drop table products
*
ERROR at line 1:
ORA-02557: cannot operate on sharded objects when shard DDL is disabled
 
 
SQL>                               
SQL>
SQL>
SQL> alter session enable shard ddl;
 
Session altered.
 
SQL> select count(*) from products;
 
  COUNT(*)
----------
      1000
 
SQL>
SQL> drop table products;
 
Table dropped.
 
SQL>
发现此时数据并没有同步,在shard node上,还是可以查询到数据:


SQL> !hostname
sdb2
 
SQL>
SQL> conn app_schema/oracle
Connected.
SQL>
SQL> select count(*) from products;
 
  COUNT(*)
----------
      1000
 
SQL>
 
 
 
SQL> !hostname
sdb3
 
SQL>
SQL> conn app_schema/oracle
Connected.
SQL>
SQL> select count(*) from products;
 
  COUNT(*)
----------
      1000
 
SQL>
此时检查shard状态,变成DDL error:


GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        DDL error   region1   ONLINE      
sh2                 primary_shardgroup  Ok        DDL error   region1   ONLINE      
 
GDSCTL>
我们可以在gsm的alertlog中看到一些提示信息:
(注:gsm的日志位置在/u01/ora12c/app/oracle/diag/gsm/sdb1/sharddirector1/trace/alert_shardcat.log)

……
07-NOV-2016 01:35:19 * (CONNECT_DATA=(SERVICE_NAME=GDS$CATALOG.oradbcloud)(CID=(PROGRAM=oracle)(HOST=sdb3)(USER=oracle12c))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.23)(PORT=34247)) * establish * GDS$CATALOG.oradbcloud * 0
2016-11-07T01:35:22.021482+08:00
07-NOV-2016 01:35:22 * service_update * shardcat%11 * 0
07-NOV-2016 01:35:22 * service_update * shardcat%1 * 0
2016-11-07T01:35:35.543868+08:00
GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"33". Pool- - "". Id="72".
Payload:"A"
GSM-40135: Catalog request: "DDL REQUEST" (80). Target -"34". Pool- - "". Id="73".
Payload:"E"
2016-11-07T01:35:36.652686+08:00
GSM-40148: Database task failed:database:"sh1", status 4, message:"ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)
 
"
GSM-40148: Database task failed:database:"sh2", status 4, message:"ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)
 
"
2016-11-07T01:35:36.700593+08:00
GSM-40157: GSM response Id=72 Type=80. Status:"Success with info".
2016-11-07T01:35:37.010475+08:00
GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh2". Pool- - "". Id="74".
Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))"
GSM-40135: Catalog request: "RUNTIME DATABASE" (40). Target -"sh1". Pool- - "". Id="75".
Payload:"(PARAMETERS=(DDLID=33)(AUTORCV=0))"
GSM-40157: GSM response Id=73 Type=80. Status:"Success".
2016-11-07T01:35:41.834561+08:00
07-NOV-2016 01:35:41 * service_update * SHARDDIRECTOR1 * 0
……
其实,从这个log中,我们只是看到有ORA-01031 insufficient privileges的报错,但究竟是什么权限不足,我们还是得通过config shard命令来进一步看:


GDSCTL>config shard -shard sh1
Name: sh1
Shard Group: primary_shardgroup
Status: Ok
State: Deployed
Region: region1
Connection string: sdb2:1521/sh1:dedicated
SCAN address:
ONS remote port: 0
Disk Threshold, ms: 20
CPU Threshold, %: 75
Version: 12.2.0.0
Last Failed DDL: alter database link "PRODUCTSD...
DDL Error: ORA-01031: insufficient privileges
ORA-06512: at "SYS.EXECASUSER", line 35
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1434
ORA-06512: at "SYS.EXECASUSER", line 23
ORA-06512: at "GSMADMIN_INTERNAL.EXECUTEDDL", line 68
ORA-06512: at line 1 (ngsmoci_execute)


Failed DDL id: 33
Availability: ONLINE


Supported services
------------------------
Name                                                            Preferred Status   
----                                                            --------- ------   
oltp_rw_srvc                                                    Yes       Enabled  

 

GDSCTL>

我们看到这个last failed ddl,是alter database link的时候出现权限不足。

所以我们找到了原因,进行修复:


--在sh1上:
[oracle12c@sdb2 ~]$ sqlplus "/ as sysdba"

SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:50:25 2016

Copyright (c) 1982, 2015, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> grant alter database link to app_schema;

Grant succeeded.

SQL>

--在sh2上
[oracle12c@sdb3 ~]$ sqlplus "/ as sysdba"

SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:51:13 2016

Copyright (c) 1982, 2015, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> grant alter database link to app_schema;

Grant succeeded.

SQL>

--在shardcat上,gsm_env下:
GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        DDL error   region1   ONLINE      
sh2                 primary_shardgroup  Ok        DDL error   region1   ONLINE      

GDSCTL>
GDSCTL>recover shard -shard sh1
GDSCTL>
GDSCTL>recover shard -shard sh2  
GDSCTL>
GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
sh2                 primary_shardgroup  Ok        Deployed    region1   ONLINE      

GDSCTL>


--在sh1上:
[oracle12c@sdb2 ~]$ sqlplus "/ as sysdba"
 
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:50:25 2016
 
Copyright (c) 1982, 2015, Oracle.  All rights reserved.
 
 
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> grant alter database link to app_schema;
 
Grant succeeded.
 
SQL>
 
--在sh2上
[oracle12c@sdb3 ~]$ sqlplus "/ as sysdba"
 
SQL*Plus: Release 12.2.0.0.2 Beta on Mon Nov 7 01:51:13 2016
 
Copyright (c) 1982, 2015, Oracle.  All rights reserved.
 
 
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.0.2 - 64bit Beta
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> grant alter database link to app_schema;
 
Grant succeeded.
 
SQL>
 
--在shardcat上,gsm_env下:
GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        DDL error   region1   ONLINE      
sh2                 primary_shardgroup  Ok        DDL error   region1   ONLINE      
 
GDSCTL>
GDSCTL>recover shard -shard sh1
GDSCTL>
GDSCTL>recover shard -shard sh2  
GDSCTL>
GDSCTL>config shard
Name                Shard Group         Status    State       Region    Availability
----                -----------         ------    -----       ------    ------------
sh1                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
sh2                 primary_shardgroup  Ok        Deployed    region1   ONLINE      
 
GDSCTL>
可以看到,在修复问题的基础上,运行recover shard -shard 之后,就恢复了正常。

注意需要注意的是,如果root cause没找到,不知道是database link的权限问题,仅仅运行recover shard是没有效果的。

相关文章

精彩推荐