Friday, April 27, 2012
Workaround for conversion of Unicode Vietnamese to ANSI
It looks like the first support for Vietnamese appeared in 1990s, which means the ANSI codepage was built in a hurry already when Unicode was either in the works or already out. The tonal nature of the language demands complex combinations of diacritics to be used. Not just regular grave accents, umlauts and such. Simply put, there were not enough slots for the new characters, so the creators of the Vietnamese codepage stuffed these new characters wherever, in whatever way possible.
Today very few people use ANSI, however, it is still needed for several reasons: legacy being one (people still work with mainframes, you know), and compliance another. Of course, there is a tried and true function WideStringToMultiByte which works like a Swiss chronometre. That is, for most languages - except Vietnamese. There are posts by Microsoft folks stating that "Vietnamese is a complex language on Windows" (duh!), but not really telling how to fix it. I asked around, no one replied, as expected (I love how Stackoverflow people react when they can't answer the question :-) ).
After scrutinising the result I saw what's the problem. It seems that the decomposing routine during the conversion is unable to handle some combinations of Unicode characters. Manually decomposing some characters to their equivalents worked for me.
I use an esoteric language called Clarion to design our tools and some components, so my original code is in Clarion. A few days ago, Mark Jacobs from Critical Research contacted me requesting help with the same issue, and kindly converted my Clarion source code to C++ more familiar to the rest of the world. Thanks, Mark!
Get Clarion source code here and Mark's C++ here.
Friday, December 10, 2010
LinguaSys - USA Today
Monday, December 6, 2010
ALTA 2010
Vadim Berman will be speaking some time between 3:30pm and 5:30pm.
Monday, June 7, 2010
Digital Sonata Signs Long Term Exclusive Agreement with LinguaSys™ For Use Of Carabao For Machine Translation
Digital Sonata signed a long term deal with LinguaSys™ for the exclusive use of Carabao in machine translation (MT) solutions on March 24, 2010.
Carabao is a hybrid language translation system using both statistical and rules-based methodologies. Vadim Berman, CEO of Digital Sonata in Australia, and Chief Technology Officer and a co-founder of LinguaSys, is the author of Carabao. Berman has a wealth of dedicated experience in the field of MT and text analysis.
Brian Garr, CEO of LinguaSys said, “We are very excited that this incredible technology from Digital Sonata will help us create the next generation of language translation solutions.”
LinguaSys is a new next generation machine translation company. LinguaSys’ Carabao language middleware uses language processing methodologies offering excellent comprehension in the least amount of time at low cost. LinguaSys enables enterprises to translate volumes of information, including text chat, e-mail, web pages and documents, quickly, accurately and automatically. LinguaSys provides the creation of new MT languages, customized lexical services, ease of use, compatibility with existing natural language software, security behind the firewall, availability, integration and lower memory requirements.
Monday, May 3, 2010
Carabao Language Kit 1.7.0.0 released
The version 1.7.0.0 is now available for download.
Fixed:
- Handling of control priority greater than 2, when some of the members have no feasible agreement graph. The result was, that some parts of the sequence worked, and some didn't.
- Truncation of very long sentences
Added:
- A utility to validate and correct rule unit values
- A generic support for formatted processing, e.g. HTML, XML, SGML including embedded formatting elements in the text flow
- GUI to test formatted processing in Carabao Test Console
- Automatic conversion of double-byte space characters into standard single-byte
Improved:
- Regular expressions for segmentation into character classes for double-byte languages
- Perl-compatible regular expressions have been introduced for unknown heuristics
- Frequency-based backtracking added to the tokenization algorithm
- Unicode clipboard support in Carabao desktop suites is now bidirectional: when leaving the application and when coming back to the application
Monday, April 19, 2010
Publication in Multilingual
My article on evaluation of emerging language technologies was published in Multingual:
http://multilingual.texterity.com/multilingual/20100405?pg=59#pg64
Wednesday, January 27, 2010
English - Swedish OLIF dictionary released
Engish - Swedish OLIF dictionary added to the list of OLIF lexicons distributed by Digital Sonata. The dictionary is available for download from http://www.digitalsonata.com/download.aspx?type=linguisticData.