This blog has moved here.

Sunday, November 01, 2009

RMAN Retention Policy with Corrupted Backups

I always assumed that RMAN is smart enough to take care of my database obsolete backups. I give it the retention policy and it's done: whenever I invoke the DELETE OBSOLETE command rman will identify those backups out of the scope of my retention policy and will safely delete them. Nevertheless, there is at least one big exception: when the taken backup is corrupted.

The following is quite self explanatory. Lets assume we have a retention policy of redundancy 1 and we take a new backup of the database.

RMAN> backup database;

Starting backup at 01-11-2009 11:20:53
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting compressed full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set

...
channel ORA_DISK_1: backup set complete, elapsed time: 00:01:26
Finished backup at 01-11-2009 11:22:20


Now, we have two backups and, according to the configured retention policy, the previous one becomes obsolete. However, let's suppose that the backup we just taken is corrupted. We can simulate this using dd (we're zeroing 1MB somewhere in between):

dd if=/dev/zero of=o1_mf_nnndf_TAG20091101T232053_5gvyxpwt_.bkp bs=1M seek=10 count=1


Okey! As a good practice it's nice to validate the backup using the "RESTORE VALIDATE BACKUP" so let's do it:

RMAN> restore validate database;

Starting restore at 01-11-2009 11:30:10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=42 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=37 device type=DISK

channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_2: starting validation of datafile backup set

...

ORA-19599: block number 1280 is corrupt in backup piece
/opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T232053_5gvyxpwt_.bkp

channel ORA_DISK_2: piece handle=/opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T232053_5gvyxp3o_.bkp
tag=TAG20091101T232053
channel ORA_DISK_2: restored backup piece 1
channel ORA_DISK_2: validation complete, elapsed time: 00:00:35
failover to previous backup

...
Finished restore at 01-11-2009 11:31:13


As you can see the BACKUP VALIDATE worked as expected. It identified the corrupted backupset and failed over to the previous valid one. However, what if at the end of the backup script there's a "delete noprompt obsolete" command?

RMAN> delete noprompt obsolete;

RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
using channel ORA_DISK_1
using channel ORA_DISK_2
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
-------------------- ------ ------------------ --------------------
Archive Log 2 01-11-2009 10:40:27 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/archivelog/2009_11_01/o1_mf_1_6_5gvwkv55_.arc
Backup Set 10 01-11-2009 11:19:57
Backup Piece 10 01-11-2009 11:19:57 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqdc_.bkp
Backup Set 9 01-11-2009 11:19:53
Backup Piece 9 01-11-2009 11:19:53 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqtm_.bkp
Backup Set 11 01-11-2009 11:20:04
Backup Piece 11 01-11-2009 11:20:04 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/autobackup/2009_11_01/o1_mf_s_701824802_5gvyw3h1_.bkp
deleted archived log
archived log file name=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/archivelog/
2009_11_01/o1_mf_1_6_5gvwkv55_.arc RECID=2 STAMP=701822427
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/backupset/
2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqdc_.bkp RECID=10 STAMP=701824695
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/backupset/
2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqtm_.bkp RECID=9 STAMP=701824695
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/autobackup/
2009_11_01/o1_mf_s_701824802_5gvyw3h1_.bkp RECID=11 STAMP=701824803
Deleted 4 objects


Uuups! It just deleted our valid backupset. The proof:

RMAN> restore validate database;

Starting restore at 01-11-2009 11:35:03
using channel ORA_DISK_1
using channel ORA_DISK_2

channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_2: starting validation of datafile backup set

...

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 11/01/2009 23:35:40
RMAN-06026: some targets not found - aborting restore
RMAN-06023: no backup or copy of datafile 5 found to restore
RMAN-06023: no backup or copy of datafile 3 found to restore
RMAN-06023: no backup or copy of datafile 2 found to restore


I don't know if the above behavior is clearly mentioned in the Oracle backup and recovery documentation but this should be taken into account when defining the backup and recovery strategy. Of course a RETENTION POLICY of 1 is not a setting to be used in productive systems but, anyway, I expect troubles even if the retention policy is set to a higher redundancy. In my option, it would be great if RMAN could label somehow the corrupted backups at the time the restore validate is invoked and then to take into account this when the retention policy is applied.

Meanwhile, in order to avoid the above scenario within your backup scripts, it's advisable to group the RESTORE VALIDATE and DELETE NOPROMPT OBSOLETE within a RUN { ... } command. If the first command fails then the DELETE command will never be executed.

No comments: