Thursday, October 4, 2007

Pure statistical MT fashion is over?

The media hype surrounding statistical machine translation has been one of the most irritating phenomena in the last years.

Machine translation is one of the most sought after technologies in the today's world, and is incredibly difficult to get right. No wonder the media trumpets about every "breakthrough" when a context is correctly extracted for a couple of sentences about Bin Laden. (Well, I can supply these in tons.) For some odd reason, statistical MT was treated by the press in a special way: no one criticizes the obvious & numerous flaws, but every small advance draws a gasp.

Needless to say, I was extremely pleased to read this: .

Some background. Language Weaver was created with the idea of pure statistical MT in mind. (So were some other startups, but it is the only one standing). Their success obviously inspired Google to snatch one of their main researchers, and both of them are still developing these wonder-engines (Google for at least 3 years).

The gradual shift to rule-based MT is not a capitulation, of course, but can be already qualified as recognition that statistics only is not a panacaea. And, if you spent some time developing MT, you should know that in a hybrid system which contains both grammatical rules and statistical analysis, rules usually take precedence (because obviously, the input of human developers is trusted more).

The complaints I get from Language Weaver's users is usually that highly inflected languages are completely messed up, and that sometimes even when the text looks coherent, the translation is simply wrong.

If someone else in Google is supervising the actions of the ex-Language Weaver MT researcher, they will do a similar thing soon. Of course, it will be difficult to detect, but they might issue an announcement about a dramatic improvement of quality (to the point where it can compete with 20 years old SYSTRAN ;) ), blah-blah-blah.

