头条焦点:问题:Duplicate报错RMAN-03009, ORA-17628, ORA-19505

2023-05-15 06:15:36 来源: 博客园

前面文章提到,这周末帮一个客户测试报错场景:客户通过duplicate生产备库的方式创建cascade备库。发现每次都会遇到两个文件报错,ORA-17628: Oracle error 19505错误,且每一次跑,报错文件不一样。 现在想帮客户验证,这属于是正常现象还是bug。


(资料图片)

  • 一、问题具体分析
  • 二、解决方案验证

一、问题具体分析

最开始查看相关MOS文档,锁定的是这两篇文章:
  • ORA-17628, ORA-19505 during RMAN DUPLICATE FROM ACTIVE (Doc ID 1331986.1)

One data file is not using OMF name while the rest of the data files are using OMF name.怀疑是对应数据文件名字的问题,但后来排除了此原因。因为比对报错数据文件和正常的数据文件并无特殊之处,且每次报错文件并不一样。

另外一篇文档是:

  • Known issues on ORA-17628: Oracle error 19505 returned by remote Oracle server Reported in Rman Active Duplicate (Doc ID 1401333.1)
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 03/28/2011 19:17:43ORA-17628: Oracle error 19505 returned by remote Oracle server

这篇文档比较全,把已知问题枚举了下,可作为参考。得到的启示是,不但要看duplicate的日志,还要去看auxiliary数据库的alert日志,有可能会有更详细的日志描述。

另外我重新审视,增加关键字,还匹配到更精确的一篇文档:

  • Duplicate from Active Database Failing with: RMAN-03009, ORA-17628, and ORA-19505 (Doc ID 1439632.1)

结论是说目标路径有重复文件,这...其实上篇文档也有提到。另外,1401333.1也说了,

  • ORA-17628 / ORA-19505 is a very generic error and we need more debugging to get the exact cause .

这个错误其实是很通用的一个错误号,如果已知问题都不能匹配,那么就需要更多的诊断来找到确切的原因。也就是又强调了下,我们还要去看比如auxiliary数据库的alert日志等详细信息。

二、解决方案验证

下面按客户要求,想测试当只有部分几个文件报错,是否可以单独备份过去?这其实算workaround的范畴,我们通常不建议这样来解决一个实际问题。但客户实在是不想反复重跑这个duplicate,因为库太大了,老这么重跑时间上确实有些伤不起...所以本着客户关怀角度,也就协助测试下这个可能性:因为这种问题确实不好直接回答是或否,只能构造场景实际测试下:

首先从一级备库duplicate到二级备库会报错:

RMAN-00571: ===========================================================RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============RMAN-00571: ===========================================================RMAN-03002: failure of Duplicate Db command at 05/14/2023 23:54:15RMAN-05501: aborting duplication of target databaseRMAN-05531: a mounted database cannot be duplicated while datafiles are fuzzy

先改为从主库duplicate到二级备库,重新跑脚本。

RMAN-00571: ===========================================================RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============RMAN-00571: ===========================================================RMAN-03002: failure of Duplicate Db command at 05/14/2023 23:56:57RMAN-05501: aborting duplication of target databaseRMAN-03015: error occurred in stored script Memory ScriptRMAN-03009: failure of backup command on prmy1 channel at 05/14/2023 23:56:57ORA-17628: Oracle error 19505 returned by remote Oracle server

看看!这里有同样的错误号,但是并不是复现问题,因为这个错误号有太多的可能性。此时正好按上面文档习得的经验,去查看二级备库(auxiliary)的alert日志有没有更详细的信息:

Sun May 14 23:56:56 2023Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_ora_6028.trc:ORA-19505: failed to identify file "/u01/oradata/jingyu/control01.ctl"ORA-27040: file create error, unable to create fileLinux-x86_64 Error: 2: No such file or directoryAdditional information: 1Sun May 14 23:57:41 2023destination database instance is "started" not "mounted"

果然,是因为一个简单的文件创建失败的错误,应该是对应的文件夹不存在,直接创建即可:

[oracle@db11gcas ~]$ cd /u01/oradata/jingyu-bash: cd: /u01/oradata/jingyu: No such file or directory[oracle@db11gcas ~]$ cd /u01/oradata[oracle@db11gcas oradata]$ ls -lddrwxr-xr-x. 2 oracle oinstall 6 May 14 23:49 .[oracle@db11gcas oradata]$ mkdir jingyu[oracle@db11gcas oradata]$ cd /u01/oradata/jingyu[oracle@db11gcas jingyu]$ 

创建对应的文件夹后再次跑脚本成功。

但是为了构造客户的问题,我这里将成功的文件故意删除掉1个。

...channel prmy1: starting datafile copyinput datafile file number=00004 name=/u01/oradata/jingyu/users01.dbfoutput file name=/u01/oradata/JYCAS/datafile/o1_mf_dbs_i_ji_1r1s5uap_.dbf tag=TAG20230515T000118channel prmy2: datafile copy complete, elapsed time: 00:00:01output file name=/u01/oradata/JYCAS/datafile/o1_mf_t2_1s1s5uap_.dbf tag=TAG20230515T000118channel prmy3: datafile copy complete, elapsed time: 00:00:01output file name=/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf tag=TAG20230515T000118channel prmy1: datafile copy complete, elapsed time: 00:00:01Finished backup at 2023-05-15 00:01:31...

就删除上面这个4号数据文件吧!删除之后肯定数据库是无法open的。

/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf[oracle@db11gcas datafile]$ rm o1_mf_users_1t1s5uaq_.dbfSQL> alter database open;alter database open*ERROR at line 1:ORA-10458: standby database requires recoveryORA-01157: cannot identify/lock data file 4 - see DBWR trace fileORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"

而且,这个时候开启mrp进程,也是不成功的:

SQL> recover managed standby database disconnect;Media recovery complete.SQL> !ps -ef|grep mrporacle    7079  6943  0 00:15 pts/1    00:00:00 /bin/bash -c ps -ef|grep mrporacle    7081  7079  0 00:15 pts/1    00:00:00 grep mrp

没有成功启动mrp进程,去看alert日志,清楚的告诉你是因为4号文件找不到:

Mon May 15 00:15:04 2023ALTER DATABASE RECOVER  managed standby database disconnect  Attempt to start background Managed Standby Recovery process (jingyu)Mon May 15 00:15:04 2023MRP0 started with pid=26, OS id=7056 MRP0: Background Managed Standby Recovery process started (jingyu) started logmerger processMon May 15 00:15:09 2023Managed Standby Recovery not using Real Time ApplyMon May 15 00:15:09 2023Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_dbw0_5520.trc:ORA-01157: cannot identify/lock data file 4 - see DBWR trace fileORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"ORA-27037: unable to obtain file statusLinux-x86_64 Error: 2: No such file or directoryAdditional information: 3MRP0: Background Media Recovery terminated with error 1110Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_pr00_7062.trc:ORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"ORA-01157: cannot identify/lock data file 4 - see DBWR trace fileORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"Slave exiting with ORA-1110 exceptionErrors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_pr00_7062.trc:ORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"ORA-01157: cannot identify/lock data file 4 - see DBWR trace fileORA-01110: data file 4: "/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf"Completed: ALTER DATABASE RECOVER  managed standby database disconnect  Recovery Slave PR00 previously exited with exception 1110MRP0: Background Media Recovery process shutdown (jingyu)

此时尝试去一级备库备份这个4号数据文件,然后传输到二级备库:

RMAN> backup datafile 4 format "/tmp/dbfile4.bak"; Starting backup at 2023-05-15 00:14:27using target database control file instead of recovery catalogallocated channel: ORA_DISK_1channel ORA_DISK_1: SID=12 device type=DISKchannel ORA_DISK_1: starting full datafile backup setchannel ORA_DISK_1: specifying datafile(s) in backup setinput datafile file number=00004 name=/u01/oradata/JYADG/datafile/o1_mf_users_0r1s27bm_.dbfchannel ORA_DISK_1: starting piece 1 at 2023-05-15 00:14:27channel ORA_DISK_1: finished piece 1 at 2023-05-15 00:14:28piece handle=/tmp/dbfile4.bak tag=TAG20230515T001427 comment=NONEchannel ORA_DISK_1: backup set complete, elapsed time: 00:00:01Finished backup at 2023-05-15 00:14:28[oracle@db11gadg ~]$ scp /tmp/dbfile4.bak 192.168.1.20:/tmporacle@192.168.1.20"s password: dbfile4.bak                                                                                                                                     100% 1344KB  80.2MB/s   00:00    [oracle@db11gadg ~]$

然后在二级备库恢复这个缺失的4号数据文件:

RMAN> list backup of datafile 4;using target database control file instead of recovery catalogspecification does not match any backup in the repositoryRMAN> catalog start with "/tmp/dbfile4.bak";searching for all files that match the pattern /tmp/dbfile4.bakList of Files Unknown to the Database=====================================File Name: /tmp/dbfile4.bakDo you really want to catalog the above files (enter YES or NO)? yescataloging files...cataloging doneList of Cataloged Files=======================File Name: /tmp/dbfile4.bakRMAN> restore datafile 4;Starting restore at 2023-05-15 00:22:12using channel ORA_DISK_1channel ORA_DISK_1: starting datafile backup set restorechannel ORA_DISK_1: specifying datafile(s) to restore from backup setchannel ORA_DISK_1: restoring datafile 00004 to /u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbfchannel ORA_DISK_1: reading from backup piece /tmp/dbfile4.bakchannel ORA_DISK_1: piece handle=/tmp/dbfile4.bak tag=TAG20230515T001427channel ORA_DISK_1: restored backup piece 1channel ORA_DISK_1: restore complete, elapsed time: 00:00:01Finished restore at 2023-05-15 00:22:14RMAN> 

此时查看alert日志,也是正常恢复:

Mon May 15 00:22:13 2023Full restore complete of datafile 4 /u01/oradata/JYCAS/datafile/o1_mf_users_l622onyr_.dbf.  Elapsed time: 0:00:00   checkpoint is 5941162  last deallocation scn is 3

需要注意下因为使用OMF,恢复的4号数据文件跟之前的名字不一样了!但是不影响。现在再尝试启动mrp进程恢复就OK了。

SQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;     FILE# CHECKPOINT_CHANGE# STATUS---------- ------------------ -------         1            5941162 SYSTEM         2            5941162 ONLINE         3            5941162 ONLINE         4            5941162 ONLINE         5            5941162 ONLINE         6            5941162 ONLINE         7            5541098 ONLINE         8            5541108 ONLINE8 rows selected.SQL> recover managed standby database disconnect;Media recovery complete.SQL> !ps -ef|grep mrporacle    7687     1  0 00:26 ?        00:00:00 ora_mrp0_jingyuoracle    7713  7606  0 00:27 pts/1    00:00:00 /bin/bash -c ps -ef|grep mrporacle    7715  7713  0 00:27 pts/1    00:00:00 grep mrpSQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;     FILE# CHECKPOINT_CHANGE# STATUS---------- ------------------ -------         1            5942030 SYSTEM         2            5942030 ONLINE         3            5942030 ONLINE         4            5942030 ONLINE         5            5942030 ONLINE         6            5942030 ONLINE         7            5541098 ONLINE         8            5541108 ONLINE8 rows selected.SQL> @dgNAME                   VALUE                  UNIT                           TIME_COMPUTED                  DATUM_TIME---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------transport lag          +00 00:17:22           day(2) to second(0) interval   05/15/2023 00:28:48            05/15/2023 00:27:53apply lag                                     day(2) to second(0) interval   05/15/2023 00:28:48apply finish time                             day(2) to second(3) interval   05/15/2023 00:28:48estimated startup time 9                      second                         05/15/2023 00:28:48

此时主库切换一下日志,再次查询级联备库状态:

SQL> @dgNAME                   VALUE                  UNIT                           TIME_COMPUTED                  DATUM_TIME---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------transport lag          +00 00:00:00           day(2) to second(0) interval   05/15/2023 00:29:14            05/15/2023 00:29:05apply lag              +00 00:00:00           day(2) to second(0) interval   05/15/2023 00:29:14            05/15/2023 00:29:05apply finish time                             day(2) to second(3) interval   05/15/2023 00:29:14estimated startup time 9                      second                         05/15/2023 00:29:14SQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;     FILE# CHECKPOINT_CHANGE# STATUS---------- ------------------ -------         1            5943460 SYSTEM         2            5943460 ONLINE         3            5943460 ONLINE         4            5943460 ONLINE         5            5943460 ONLINE         6            5943460 ONLINE         7            5541098 ONLINE         8            5541108 ONLINE8 rows selected.SQL> @dgNAME                   VALUE                  UNIT                           TIME_COMPUTED                  DATUM_TIME---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------transport lag          +00 00:00:48           day(2) to second(0) interval   05/15/2023 00:29:56            05/15/2023 00:29:54apply lag              +00 00:00:48           day(2) to second(0) interval   05/15/2023 00:29:56            05/15/2023 00:29:54apply finish time                             day(2) to second(3) interval   05/15/2023 00:29:56estimated startup time 9                      second                         05/15/2023 00:29:56

说明这个恢复缺失数据文件再开启mrp进程的方案可行。但出于从根本解决问题的初衷,我们还是会建议找到具体原因重新跑的方式来恢复。

标签:

[责任编辑:]

最近更新