Friday, April 27, 2012

Workaround for conversion of Unicode Vietnamese to ANSI

A few months ago, we had an interesting project which involved Vietnamese. As a part of the project, we have hit a minor snag.

It looks like the first support for Vietnamese appeared in 1990s, which means the ANSI codepage was built in a hurry already when Unicode was either in the works or already out. The tonal nature of the language demands complex combinations of diacritics to be used. Not just regular grave accents, umlauts and such. Simply put, there were not enough slots for the new characters, so the creators of the Vietnamese codepage stuffed these new characters wherever, in whatever way possible.

Today very few people use ANSI, however, it is still needed for several reasons: legacy being one (people still work with mainframes, you know), and compliance another. Of course, there is a tried and true function WideStringToMultiByte which works like a Swiss chronometre. That is, for most languages - except Vietnamese. There are posts by Microsoft folks stating that "Vietnamese is a complex language on Windows" (duh!), but not really telling how to fix it. I asked around, no one replied, as expected (I love how Stackoverflow people react when they can't answer the question :-) ).

After scrutinising the result I saw what's the problem. It seems that the decomposing routine during the conversion is unable to handle some combinations of Unicode characters. Manually decomposing some characters to their equivalents worked for me.

I use an esoteric language called Clarion to design our tools and some components, so my original code is in Clarion. A few days ago, Mark Jacobs from Critical Research contacted me requesting help with the same issue, and kindly converted my Clarion source code to C++ more familiar to the rest of the world. Thanks, Mark!

Get Clarion source code here and Mark's C++ here.

1 comment:

Vadim Berman said...

The long-forgotten post was suddenly noticed by Michael Kaplan of Microsoft, whose spirit I was trying to summon before resorting to this fix.

Setting Michael's puzzling (let's stick to euphemisms) attitude aside, I would like to post my reply - nothing shows up in Michael's blog when I comment directly.

If Michael was paying attention to the title and the rest of the article, he'd notice that the scope of the fix was only one direction: from Unicode to ANSI.

From what I have seen, there is no issue with the opposite direction (and strictly speaking, I did not need it).

For some reason, I don't see it as a personal duty to fix old code of Microsoft. I only fix what I need, Michael. Sorry about not doing your job for you entirely.

Of course, there is also the fact that the C++ code is not even mine (see the last paragraph of this post), but as I understand, it works - just like my Clarion code.

I cannot comment about the scalability claim; maybe Michael meant memory allocation issues, probably in C++ code.