Oracle ASM in Azure with premium (SSD) disks - warning



So after following my previous notes on installing ASM into Azure, i repeated everything but using premium (SSD) disks and all appeared to be well - until that is i did a data pump load into it and put some stress on the system - then (and this was consistently every time) this happened....

ORA-00354: corrupt redo log block header 
ORA-00353: log corruption near block 196020 change 378858 time 02/23/2017 23:54:17 
ORA-00312: online log 2 thread 1: '+FRA/AZAPA/ONLINELOG/group_2.257.936744263' 
ORA-00312: online log 2 thread 1: '+DATA/AZAPA/ONLINELOG/group_2.259.936744261' 



Nice eh - I don't think i've ever seen that before in 20 years of doing this.....

So what is going on - a Redhat bug? An Azure bug? An Oracle bug even?

Well i started off by raising a ticket with all three to see where it went and was worried that this was going to turn into a finger pointing exercise.

Well surprisingly (after a slow start) Oracle came up with the solution and some interesting information at the same time.

First up was the interesting info



Quite a surprise that one!

So then i set about provisioning a new OEL server just to make sure i was on a fully 'certified/supported' platform - whatever the subtle difference in that is I'm not sure.

After re-running the test case the same corruption happened so seems its not a redhat issue.

So now on to the next bit of information from oracle - which surprising was hard to find without the direct links that were posted in the SR.

The problem was actually caused by the way data is written to SSD's - it doesn't use the same block size as traditional magnetic disks (512 bytes) - everything it does is 4096 bytes. For most Oracle files this isn't an issue as they are writing more than 512 bytes anyway (datafiles,controlfiles etc) - however for one very important process (LGWR) this is an issue and is the thing causing the corruption of the redo logs i was seeing.

OK - so now we know the cause what's the fix?

Well couple of things as it seems that ASM is not auto detecting the fact that these SSD's want to work with 4K - the first step is to make sure the devices are discovered with the right sector size (4K) - to do this we run the create disk group command with an extra parameter - however by default it won't accept it - see below


SQL>  create diskgroup data external redundancy disk '/dev/oracleasm-fra1' attribute 'sector_size'='4096';
 create diskgroup data external redundancy disk '/dev/oracleasm-fra1' attribute 'sector_size'='4096'
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created

ORA-15038: disk '/dev/oracleasm-fra1' mismatch on 'Sector Size' with target disk group [512] [4096]

So - it's still refusing to use 4K - not sure if this is just a feature of the disks in Azure or if this is a general problem with SSD - anyway the way to force ASM to work in 4K is like this:

SQL> alter system set "_disk_sector_size_override"=TRUE scope=spfile
  2  /

System altered.

So now it's using the correct 4K size - so the ASM disk groups are now fine.

However there is one more step we have to be mindful of when adding the redo logs - and here it is:

SQL> create database character set utf8 logfile group 1 size 1G blocksize 4096,group 2 size 1G blocksize 4096;

Database created.

Now i did that in the create database statement (which is probably wise unless you are switching to SSD later on) - the key bit is the optional block size 4096 clause at the end (default 512) - now the redo logs will be written in 4K rather than 512 bytes.

So after those changes i re-ran the import and am pleased to say all looks good - no sign of the corruption.

Now i just want to do some performance testing of the various asm/non-asm options to see which gives us the best performance. I would hope ASM should be faster than any other option


5 comments:

  1. Hi,
    If you use oracle asmlib, you can configure this for a diskgroup by using oracleasm configure -I.
    It will ask you if you want to use logical size for your DG (check /sys/block//queue/logical_block_size and /sys/block//queue/physical_block_size)

    HTH

    ReplyDelete
  2. Thanks Laurent - that will be useful for others I'm sure. I personally don't use ASMLIB - I couldn't see the point of it when it first came out - it just seemed to add another layer into things that didn't really add anything.

    That was quite a while ago though - maybe I should revisit it and see what the current implementation looks like.

    Cheers,
    Rich

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. ASMFD (ASM Filter Driver) Support on OS Platforms (Certification Matrix). (Doc ID 2034681.1)

    Still an extra layer though everything points towards more complexity.. it will be ever harder to withstand

    Cheers,
    Balazs

    ReplyDelete
  5. Hmm interesting - still not really enough to sell it to me for our usage - however future enhancements mentions thin provisioning - that may then be very relevant.....

    ReplyDelete