Digital Sonata's Blog: 2007

Thursday, December 27, 2007

Carabao Language Kit 1.0.0.2 released

The version 1.0.0.2 is now available for download.

Fixed:

Inflection generation problems of TagLemma results (words not in the dictionary) in Carabao MorphoLogic

Added:

Capability to inspect other guesses. For example, in a sequence like "adverb" + "adverb", it is possible to quickly scrap the entire sequenec if the second adverb can be a preposition
Comprehensive morphology of Russian language

Improved:

Removed description of negative constraint elements (those that do not have an identity) in sequence in order to make the descriptions less cluttered
Performance of sequence processing
Accuracy of sequences
Domains

Thursday, November 22, 2007

Carabao Language Kit 1.0.0.1 released

Following the first feedbacks and testing results, we made certain changes to the English lexicon, increasing its accuracy. The version 1.0.0.1 is now available for download.

Fixed:

Various validation problems with attached tokens

Lookup windows are no longer maximized on opening

Incorrect tooltips after deletion in the dictionary table

Added:

GUI support for negative constraints in sequences

Handling of irregular 'smart quotes' in Translation Console

Manual disambiguation table in Carabao Linguist Edition

Style tags to the tooltips in the dictionary table

Improved:

Supplied syntactic structures for English

In the translation console, the original thesaurus article is suppressed when the word is part of an idiom - to prevent confusion

Sunday, November 4, 2007

BLEU & statistical MT

If you're into machine translation, you might have noticed that one of the killer arguments in favour of the statistical MT is that these systems score higher on NIST evaluations (tests aimed to evaluate the accuracy of the output). NIST has selected IBM's BLEU as a method of measuring accuracy.

Google's statistical translation engine, for example, has beaten all the others. If you played a bit with Google Translate BETA pairs (all the others are SYSTRAN, so ignore them) and compared them with more traditional systems, you are probably as surprised as I was. While context detection works, of course, better (this is what statistical MT is for, anyway), long sentences are garbled, and the grammar is simply hopeless. Sometimes their dictionary harvesters result a new type of errors, confusing between proper names in the same category (I personally witnessed Abramovich, a Russian billionaire, being translated as Berezovsky, another Russian billionaire, and Vedomosti, a Russian newspaper, as Yahoo! - probably the original story was created by the former, and translated by the latter without bothering to mention the original). From the human point of view, most statistical MTs are definitely no better (frequently, much worse) than the traditional rule-based ones.

Is this just my impression that BLEU favours statistical MTs for no reason?

Turns out, not really. Eduard Hovy of University of Southern California published a paper dedicated to this topic: http://www.elra.info/mtsummit2007/Pres-3-Hovy.pdf

To cut a long story short: Prof. Hovy says that the reason is that BLEU, just like the statistical MTs, are counting the exact matches. As the rule-based dictionaries are built on broader definitions, these will be counted as errors as well (being absolutely correct). And, if there is a different, yet correct, order, this will be counted as an error.

I'd add another reason: BLEU compares lemmas, and frequently inflections (like morphological case in European languages) play paramount role in conveying the meaning (for example, accusative case means a direct object, and dative case can be translated as an indirect object or a complement; all this is ignored by BLEU). As mentioned above, statistical MTs are not very adept in grammar.

It might be worthy to add that BLEU seeks correlation on a corpus level. Statistical MTs learn from corpora as well. Given the limited number of high-quality corpora, I don't think it'd be surprising if both NIST's BLEU evaluator and the system being evaluated learn from the same corpus. So if both harvesters made the same mistake (quite possible), and a rule-based system was correct, guess who will be penalized.

What about NIST, do they relate to it somehow?

Yes, they do:

http://www.nist.gov/speech/tests/mt/doc/mt06eval_official_results.html
(look for Performance Measurement).

It is not that BLEU is the only one in the field. I personally consider METEOR much more promising, especially because it (finally!) takes synonyms into consideration. And, of course, the work never stops and there are numerous attempts to improve the correlation between the human perception and the evaluation mark.

So why was it still used in 2006 evaluation? I guess the central reason is that BLEU was the first one, and it takes time to change the procedures. And, it might not be so easy to kick IBM's stuff out.

Wednesday, October 31, 2007

Digital Sonata releases Carabao Linguist Edition

Digital Sonata announces the release of Carabao Linguist Edition desktop suite.

Carabao Linguist Edition desktop suite allows users to import bulk data from Carabao Exchange XML files. The entries in the imported dictionaries are matched either by their ID number or using fuzzy comparison with existing entries of any language in the database. In addition, Carabao Linguist Edition contains a module which allows for the transformation of unstructured "paper" dictionaries into machine-readable OLIF XML format.

Sunday, October 7, 2007

The Babelfish Tartuffe

A+ for the idea.

Mangiare Theatre Company, Ireland, are playing Molière’s classic comedy Tartuffe as translated into English by Babelfish (or SYSTRAN, in other words) on Dublin Fringe Festival:

http://www.fringefest.com/shows/67

I wish they'd come to Melbourne.

Thursday, October 4, 2007

Pure statistical MT fashion is over?

The media hype surrounding statistical machine translation has been one of the most irritating phenomena in the last years.

Machine translation is one of the most sought after technologies in the today's world, and is incredibly difficult to get right. No wonder the media trumpets about every "breakthrough" when a context is correctly extracted for a couple of sentences about Bin Laden. (Well, I can supply these in tons.) For some odd reason, statistical MT was treated by the press in a special way: no one criticizes the obvious & numerous flaws, but every small advance draws a gasp.

Needless to say, I was extremely pleased to read this:
http://www.multilingual.com/newsDetail.php?id=5917 .

Some background. Language Weaver was created with the idea of pure statistical MT in mind. (So were some other startups, but it is the only one standing). Their success obviously inspired Google to snatch one of their main researchers, and both of them are still developing these wonder-engines (Google for at least 3 years).

The gradual shift to rule-based MT is not a capitulation, of course, but can be already qualified as recognition that statistics only is not a panacaea. And, if you spent some time developing MT, you should know that in a hybrid system which contains both grammatical rules and statistical analysis, rules usually take precedence (because obviously, the input of human developers is trusted more).

The complaints I get from Language Weaver's users is usually that highly inflected languages are completely messed up, and that sometimes even when the text looks coherent, the translation is simply wrong.

If someone else in Google is supervising the actions of the ex-Language Weaver MT researcher, they will do a similar thing soon. Of course, it will be difficult to detect, but they might issue an announcement about a dramatic improvement of quality (to the point where it can compete with 20 years old SYSTRAN ;) ), blah-blah-blah.

Monday, September 24, 2007

Getting your COM objects to run on a shared web server

It turns out to be possible to run COM objects without registering them with regsvr32 or COM+ thingy. Furthermore, this is how we operate our online demos here :-) (of course, we plan moving to a dedicated one ASAP 'cause some are real heavy, especially DeepAnalyzer). Clarion developers: these are pure Clarion COMs.

The secret is to use so-caled side-by-side execution (SxS) which has been around since Windows 2000 (or XP) in ASP.NET environment (yet another reason to migrate to ASP.NET). ASP.NET 1.1 had this activation model as a default; ASP.NET 2.0 had it removed, but there is a workaround: a free (for now!) 3rd party library by MazeComputer taking advantage of ASP.NET "HTTP modules".

A few gotchas:

If you have a separate TLB (not linked inside the DLL), don't rely on MS Studio manifests. Build them yourself. Attached is an example of a working manifest.

Do not use AspCompat=true directive on that page. Even though it is recommended with COM objects, AspManifests does not work with it. Their support told me it's because the "performance boosters" create disarray between threads (in fact, AspCompat is supposed to limit everything to just one thread (IIRC) - strange).

Probably because of some kind of threading issue with the same library, you need not only to release the memory (myObj = Nothing), but also call the garbage collector to collect the garbage (GC.Collect()) after each disposal. Otherwise, you might get crashes here and there.

All the rest works like a Swiss chronometer. Feel free to ask questions if anything is not clear.

Monday, September 17, 2007

Digital Sonata releases Carabao Language Kit 1.0.0.0

Digital Sonata is proud to announce the release of Carabao Language Kit 1.0.0.

Carabao is a family of linguistic tools providing the following capabilities:

Sense disambiguation & text understanding

Detailed, sentence by sentence domain extraction

Machine readability evaluation

Automatic translation between languages

Deep morphological analysis and synthesis

Transliteration between scripts

Named entity recognition and classification

Automatic linguistic profiling

Universal measure conversion

The most distinctive feature of Carabao is the engine's complete abstraction - from the linguistic point of view. All the linguistic logic resides in a database complete with a powerful GUI data editor. This enables users to tweak, modify, alter the engine (to the point of adding a new language) in every possible way without writing one line of source code.

In addition to more straightforward purposes, Carabao can be integrated with 3rd party machine translation and other NLP packages (such as OCR or even voice recognition) to improve their accuracy.

Digital Sonata's Blog