Characterset the email revenge



Now if you've read a few of my blog posts over the past few years you'll know i have an ongoing hate/hate relationship with anything to do with charactersets.

Things had been reasonable quiet recently until today where i got the all too familiar call about 'why does my data look funny?'

This time however was a little different to normal and actually it is maybe not really Oracle's fault although we use oracle to fix it.......

So what was happening this time i hear you ask - we'll let me tell you....

We have some clever little plsql code that enables us to email out to end users with a csv attachment that they can open directly in excel and work with - this is a pure plsql solution and is pretty neat (not my creation i hasten to add) - you can find similar example on the internet by searching for plsql and csv.

This had been working happily for many years in one of our systems and we had a requirement to re-use that code in another system - so we duly installed it all ran tests and found that all worked as expected......well it did until we added some non english characters into the mix.

The stange thing here was though the original database was already sending non english characters and they worked fine.

So whats going on?

The text in question is this:

 SÅGAB-Sågverken Norrland Ek. För
 
Which looks like either Norwegian or Swedish - i ran the plsql from both systems. The 'original' system worked fine and the data opened fine, the other was corrupting the data - you can see what i mean from the pictures below:

Original:


New database:


 
 
So what's going on - whats the difference here?
 
Well again it's charactersets - the original working db is WE8MSWIN1252 - the new one is UTF8.
 
Before i go and throw my laptop out the window lets go and have a closer look at this file its generated - the first thing i did was open the file in notepad rather than excel to see if that told me anything - and in notepad the text looked fine......strange indeed....
 
Then i tried importing the csv into excel rather than opening natively - this also worked fine.....
 
So it seems the issue is only when excel opens a csv directly that has come from the plsql code......
 
OK - now i have something to go on so i search the net and discover there is a bug/feature in excel that prevents it in some cases identifying a file as unicode - in these cases it tries the default characterset and ends up showing corrupt data - this is exactly what was happening in my case.
 
So what's the fix?
 
Well some posts suggested Excel 2013 fixes this - but that doesn't help me. Another comment i found is that there is a missing Byte Order Mark (BOM) at the beginning of the CSV that excel wants to be there to work properly - so we should try and add that to fix is.
 
It took me a while to get my head round how the plsql was constructed but in the end here are the few extra lines i added before the main csv data is populated
 
/* lots of other plsql before this */
 
/* this is the block to set the BOM correctly that fixes the issue */
        common.pg_mail.pr_write_raw(
            p_conn    => conn,
            p_message => utl_encode.base64_encode(HEXTORAW('EFBBBF'))
        );
 
/* this is the main csv data outputter */
    while length(v_text) > 0 loop
        data := utl_raw.cast_to_raw(substr(v_text, 1, common.pg_mail.MAX_BASE64_LINE_WIDTH));
        v_text := substr(v_text, common.pg_mail.MAX_BASE64_LINE_WIDTH + 1);
        common.pg_mail.pr_write_raw(
            p_conn    => conn,
            p_message => utl_encode.base64_encode(data)
        );
    end loop;
 
/* more plsql after this */
 
With this tiny bit of extra code the invisible BOM characters are inserted at the start of the file and the issue is magically fixed!
 
Not sure if this is really an Oracle or Microsoft issue - i'm blaming Microsoft in this case.......
 


Comments