Sunday, November 01, 2009

RMAN Retention Policy with Corrupted Backups

I always assumed that RMAN is smart enough to take care of my database obsolete backups. I give it the retention policy and it's done: whenever I invoke the DELETE OBSOLETE command rman will identify those backups out of the scope of my retention policy and will safely delete them. Nevertheless, there is at least one big exception: when the taken backup is corrupted.

The following is quite self explanatory. Lets assume we have a retention policy of redundancy 1 and we take a new backup of the database.

RMAN> backup database;

Starting backup at 01-11-2009 11:20:53
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting compressed full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set

...
channel ORA_DISK_1: backup set complete, elapsed time: 00:01:26
Finished backup at 01-11-2009 11:22:20


Now, we have two backups and, according to the configured retention policy, the previous one becomes obsolete. However, let's suppose that the backup we just taken is corrupted. We can simulate this using dd (we're zeroing 1MB somewhere in between):

dd if=/dev/zero of=o1_mf_nnndf_TAG20091101T232053_5gvyxpwt_.bkp bs=1M seek=10 count=1


Okey! As a good practice it's nice to validate the backup using the "RESTORE VALIDATE BACKUP" so let's do it:

RMAN> restore validate database;

Starting restore at 01-11-2009 11:30:10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=42 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=37 device type=DISK

channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_2: starting validation of datafile backup set

...

ORA-19599: block number 1280 is corrupt in backup piece
/opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T232053_5gvyxpwt_.bkp

channel ORA_DISK_2: piece handle=/opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T232053_5gvyxp3o_.bkp
tag=TAG20091101T232053
channel ORA_DISK_2: restored backup piece 1
channel ORA_DISK_2: validation complete, elapsed time: 00:00:35
failover to previous backup

...
Finished restore at 01-11-2009 11:31:13


As you can see the BACKUP VALIDATE worked as expected. It identified the corrupted backupset and failed over to the previous valid one. However, what if at the end of the backup script there's a "delete noprompt obsolete" command?

RMAN> delete noprompt obsolete;

RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
using channel ORA_DISK_1
using channel ORA_DISK_2
Deleting the following obsolete backups and copies:
Type Key Completion Time Filename/Handle
-------------------- ------ ------------------ --------------------
Archive Log 2 01-11-2009 10:40:27 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/archivelog/2009_11_01/o1_mf_1_6_5gvwkv55_.arc
Backup Set 10 01-11-2009 11:19:57
Backup Piece 10 01-11-2009 11:19:57 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqdc_.bkp
Backup Set 9 01-11-2009 11:19:53
Backup Piece 9 01-11-2009 11:19:53 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/backupset/2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqtm_.bkp
Backup Set 11 01-11-2009 11:20:04
Backup Piece 11 01-11-2009 11:20:04 /opt/oracle/app/oracle/flash_recovery_area
/VENUSDB/autobackup/2009_11_01/o1_mf_s_701824802_5gvyw3h1_.bkp
deleted archived log
archived log file name=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/archivelog/
2009_11_01/o1_mf_1_6_5gvwkv55_.arc RECID=2 STAMP=701822427
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/backupset/
2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqdc_.bkp RECID=10 STAMP=701824695
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/backupset/
2009_11_01/o1_mf_nnndf_TAG20091101T231814_5gvyrqtm_.bkp RECID=9 STAMP=701824695
deleted backup piece
backup piece handle=/opt/oracle/app/oracle/flash_recovery_area/VENUSDB/autobackup/
2009_11_01/o1_mf_s_701824802_5gvyw3h1_.bkp RECID=11 STAMP=701824803
Deleted 4 objects


Uuups! It just deleted our valid backupset. The proof:

RMAN> restore validate database;

Starting restore at 01-11-2009 11:35:03
using channel ORA_DISK_1
using channel ORA_DISK_2

channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_2: starting validation of datafile backup set

...

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 11/01/2009 23:35:40
RMAN-06026: some targets not found - aborting restore
RMAN-06023: no backup or copy of datafile 5 found to restore
RMAN-06023: no backup or copy of datafile 3 found to restore
RMAN-06023: no backup or copy of datafile 2 found to restore


I don't know if the above behavior is clearly mentioned in the Oracle backup and recovery documentation but this should be taken into account when defining the backup and recovery strategy. Of course a RETENTION POLICY of 1 is not a setting to be used in productive systems but, anyway, I expect troubles even if the retention policy is set to a higher redundancy. In my option, it would be great if RMAN could label somehow the corrupted backups at the time the restore validate is invoked and then to take into account this when the retention policy is applied.

Meanwhile, in order to avoid the above scenario within your backup scripts, it's advisable to group the RESTORE VALIDATE and DELETE NOPROMPT OBSOLETE within a RUN { ... } command. If the first command fails then the DELETE command will never be executed.

Tuesday, April 28, 2009

CREATE VIEW with FORCE does not work

Yesterday I loaded an oracle dump in our 10.2.0.4 database... and guess what? Not all the views were created. I took a look into the impdp log and I saw some errors complaining that: ORA-00980: synonym translation is no longer valid. So what? The CREATE VIEW statements were issued with the FORCE clause therefore it should have been created, right?

Well, after some diggings on metalink I found this. It basically says that there is a(nother) bug and according to their description: create force view using a synonym for a table fails to create the view if the synonym is invalid. The 10.2.0.3 and 10.2.0.4 databases are confirmed to be affected and this bug is supposed to be fixed in 10.2.0.5 and 11.2.

In my case, the solution was to fix the synonyms problem and after that to reimport just the views using the INCLUDE parameter of the impdp utility.

Thursday, February 12, 2009

WTF is that? (ep. 2)

Today, the next episode of the Oracle WTF stories. One of my colleague brought to my attention the fact that the DECODE function doesn't work as expected when used with dates. He had a very simple test case:

create table muci (my_date date);

insert into muci
select decode(to_date('30/12/2099', 'dd/mm/yyyy'),
sysdate,
null,
to_date('30/12/2099', 'dd/mm/yyyy')) from dual;

He asked me: what we'll have in "MUCI" table after running the statements above? I didn't think too much. I realized that SYSDATE is not likely to be 30/12/2099, even the possibility of having a wrong setting in the OS clock couldn't be excluded, but anyway, I simply said that the final result should be 30/12/2099.

Let's take a look:

SQL> select to_char(my_date, 'dd/mm/yyyy') from muci;

TO_CHAR(MY_DATE,'DD/MM/YYYY')
-----------------------------
30/12/1999


Well, this was unexpected.. WTF? What's wrong with the YEAR? Even with a wrong OS clock setting this shouldn't happen. The reason must be somewhere else. Because I remembered that the result of DECODE depends on the type of the arguments, I said: let's look into docs! Yeap, the answer was there: "if the first result is null, then Oracle converts the return value to the datatype VARCHAR2". How this applies to our test case? It's simple: in fact, the whole result of the DECODE is a VARCHAR2 and not a DATE as one might think. The VARCHAR2 representation of a plain date value depends on the NLS_DATE_FORMAT, which on our server was:

SQL> select value from nls_session_parameters
where parameter='NLS_DATE_FORMAT';

VALUE
----------------------------------------
DD-MON-RR

So, when the INSERT was done, the inserted value was '30/12/99' which was further automatically casted to a DATE according to the NLS_DATE_FORMAT setting and we ended up with a “wrong” year in the final result. Lovely!

Tuesday, October 07, 2008

Remotely Connect to a RESTRICT Opened Database

Lately, I have this sub-conscience mantra which basically says: don’t believe everything Oracle Official Docs say but try and prove those facts! For example, one thing to try is starting an instance in restricted mode and prove after that what Oracle says in the Administration Guide 11g/Starting Up a Database chapter:

when the instance is in restricted mode, a database administrator cannot access the instance remotely through an Oracle Net listener, but can only access the instance locally from the machine that the instance is running on.

Lets try! On the server:

SQL> startup restrict
ORACLE instance started.

Total System Global Area 835104768 bytes
Fixed Size 2149000 bytes
Variable Size 595592568 bytes
Database Buffers 230686720 bytes
Redo Buffers 6676480 bytes
Database mounted.
Database opened.


On the client, using an admin user:

Enter user-name: admin@tbag
Enter password:
ERROR:
ORA-12526: TNS:listener: all appropriate instances are in restricted mode


What they forget to say here is the fact that this behavior is obtained just with dynamic listener registration. If I’m going to explicitly specify the SID_LIST within my listener.ora file then I can connect remotely without problems.

Thursday, June 12, 2008

Oracle Linux Date

If you’ll ever need to get the current Linux time from Oracle then you might be interested in the following solution. First of all, the Linux epoch time is expressed as the number of seconds since 1970-01-01 00:00:00 UTC and can be obtain by using the date +'%s' command. For example:


oracle@oxg:~$ date +'%s'
1213261534

From Oracle you can use the following custom function:


create or replace function current_linux_date return integer is
l_crr_date timestamp(9) := SYS_EXTRACT_UTC(systimestamp);
l_ref_date timestamp(9) := to_date('01011970', 'ddmmyyyy');
l_seconds integer;
begin
l_seconds := extract(day from (l_crr_date - l_ref_date)) * 24 * 3600 +
extract(hour from (l_crr_date - l_ref_date)) * 3600 +
extract(minute from (l_crr_date - l_ref_date)) * 60 +
extract(second from (l_crr_date - l_ref_date));
return(l_seconds);
end current_linux_date;
/

Now, you should get the same result from Oracle:


SQL> select current_linux_date from dual;

CURRENT_LINUX_DATE
------------------
1213261993

oracle@oxg:~$ date +'%s'
1213261993

Have fun!

Monday, May 12, 2008

Profiling the new SIMPLE_INTEGER type

Oracle 11g comes with a new PLSQL type called SIMPLE_INTEGER. The official documentation says that this type yield significant performance compared to PLS_INTEGER type. Because I want to see this with my own eyes I’ve decided to test it using another new 11g component called hierarchical profiler which I also want to see how it’s working.

First of all, let’s setup the environment:

1. on the database server create a new directory to be used for creating profiler trace files:

oracle@obi:oracle$ mkdir profiler
oracle@obi:oracle$ chmod o-rx profiler/


2. create the DIRECTORY object in the database too, and grant read/write privileges to the testing user (in our case TALEK user):

SQL> create directory profiler_dir as '/opt/oracle/profiler';

Directory created.

SQL> grant read, write on directory profiler_dir to talek;

Grant succeeded.


3. grant execute privilege for DBMS_HPROF package to the TALEK user:

SQL> grant execute on dbms_hprof to talek;

Grant succeeded.


4. connect using TALEK user and create the following package (the only difference between the first and second approach is the type of the l_count variable):

create or replace package trash is

procedure approach_1;

procedure approach_2;

end trash;
/

create or replace package body trash is

procedure approach_1 as
l_count pls_integer := 0;
begin
for i in 1..10000 loop
l_count := l_count + 1;
end loop;
dbms_output.put_line(l_count);
end;

procedure approach_2 as
l_count simple_integer := 0;
begin
for i in 1..10000 loop
l_count := l_count + 1;
end loop;
dbms_output.put_line(l_count);
end;

end trash;
/


5. Profile the approaches:

SQL> exec dbms_hprof.start_profiling(location => 'PROFILER_DIR', filename => 'test.trc');

PL/SQL procedure successfully completed

SQL> exec trash.approach_1;

PL/SQL procedure successfully completed

SQL> exec trash.approach_2;

PL/SQL procedure successfully completed

SQL> exec dbms_hprof.stop_profiling;

PL/SQL procedure successfully completed


6. Analyze the generated trace file. For this we’ll use the "plshprof" command line utility.

oracle@obi:profiler$ plshprof -output report test.trc
PLSHPROF: Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
[8 symbols processed]
[Report written to 'report.html']


Aaaaand, the WINNER is:

TALEK.TRASH.APPROACH_1 -> 5713 (microseconds)
TALEK.TRASH.APPROACH_2 -> 100706 (microseconds)


Well… this is unexpected. According to Oracle docs, the SIMPLE_INTEGER should be faster. Ok, back to official doc: "The new PL/SQL SIMPLE_INTEGER data type is a binary integer for use with native compilation which is neither null checked nor overflow checked". Ahaaa… native compilation! Let’s check this:

SQL> show parameter plsql_code_type

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
plsql_code_type string INTERPRETED


So, we have a first clue and a first conclusion. If the database doesn’t use NATIVE compilation the SIMPLE_INTEGER type is actually much slower.

Let’s switch to native compilation. This can be easily done because the "plsql_code_type" parameter is dynamic:

SQL> alter system set plsql_code_type=native scope=both;

System altered.


It is important to compile once again the package because otherwise the old PLSQL byte code will be used (you can use "alter package trash compile plsql_code_type=native;"), then repeat the profiler tests.

The new results are:

TALEK.TRASH.APPROACH_2 -> 3927 (microseconds)
TALEK.TRASH.APPROACH_1 -> 12556 (microseconds)


Now, the second approach with SIMPLE_INTEGER is much faster and, interestingly, the PLS_INTEGER approach is slightly slower on native compilation compared with the same approach on the initial PLSQL interpreted environment.

Okey, one more thing. I really enjoy using the new 11g hierarchical profiler. From my point of view is a big step forward compared with the old DBMS_PROFILER, and the provided HTML reports produced by "plshprof" are quite lovely.

Sunday, May 11, 2008

WTF is that? (ep. 1)

I've just decided to put here some posts about the (...well, you know) Oracle WTF stories, those moments (... hours, and sometimes days) when being at my desk, I'm just staring at that stupid SQLPLUS> prompt couldn't figure out what the hell is happening.

Today, episode 1:

The scenario is very simple. I have two nice users: TALEK and SIM. TALEK has a table and gives UPDATE rights to SIM.

SQL> connect talek
Enter password:
Connected.

SQL> create table muci (col1 varchar2(10));

Table created.

SQL> insert into muci values ('abc')
2 /

1 row created.

SQL> commit;

Commit complete.

SQL> grant update on muci to sim;

Grant succeeded.

SQL> connect sim
Enter password:
Connected.

SQL> update talek.muci set col1='xyz' where col1='abc';
update talek.muci set col1='xyz' where col1='abc'
*
ERROR at line 1:
ORA-01031: insufficient privileges


Of course, this is the WTF moment. Why the UPDATE failed? The first thing to do is to check the DBA_TAB_PRIVS view for a confirmation that the UPDATE privilege is still there. (I'm pretty sure that no one was so fast to revoke meanwhile the granted privilege but, just in case...)

SQL> select grantee, owner, table_name, privilege 
from dba_tab_privs where table_name='MUCI' and owner='TALEK';

GRANT OWNER TABLE_NAM PRIVILEGE
----- ----- --------- ----------
SIM TALEK MUCI UPDATE


And yes, the privilege is there. Hmmm... what's next? Usually the next thought is that another Oracle bug makes fun of me. But, this sounds too scary to be true. Finally, the stupid answer comes to light.

SQL> show parameter sql92_security

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
sql92_security boolean TRUE


The database reference documentation says the following:
"The SQL92 standards specify that security administrators should be able to require that users have SELECT privilege on a table when executing an UPDATE or DELETE statement that references table column values in a WHERE or SET clause. SQL92_SECURITY specifies whether users must have been granted the SELECT object privilege in order to execute such UPDATE or DELETE statements."

With the above sql92_security parameter set, actually the "where col1='abc'" filter from the UPDATE statement complains about "insufficient privileges" and not the UPDATE itself. Without a filter the update executes as expected:

SQL> update talek.muci set col1='xyz';

1 row updated.


Ok, another lesson has been learned!