Friday, March 14, 2008

Hype, hype, hype

Got this article from TAUS today.

Man, is this annoying or what. With claims like these, the MT industry is going to move from Web 2.0 to Dot Com Bubble Burst 2.0.

I do hold high respect for Asia Online's team. Phillip Koehn co-authored some papers with Franz Och, the guy behind Language Weaver and Google Translate. Dion Wiggins has some super-duper credentials as an IT businessman, although with no experience at all in MT or NLP (and this industry is very different).

From the technological point of view, Koehn did the right thing (IMHO) going in the hybrid direction instead of looking for the philosopher's stone of pure ideal engine that builds itself and with some kind of divination deduces stuff that hand-crafted rules can't. It is also refreshing to see that they decided to use a high quality corpus.

But - c'mon, people!

The article contains odd bits such as:

...adding a syntax component to a SMT system "seriously degrades throughput performance from 5,000 words a minute to only around 300" on a machine with 4 high speed CPUs

Did they mean the SMT system? Because how can you measure speed of all SMT systems?

If this is the SMT system (i.e. they already built the kernel), this probably means they are going to use Moses. Which calls for more questions. Koehn created Pharaoh and Moses; both are open-source; both have been around for a while; yet I never heard of a commercial or even semi-commercial application that uses them. And I know at least one huge translation agency that launched a project to create their own MTs based on Moses.

There is also EuroMatrix, an all-you-can-spend research project, where Koehn also took part, and which, just like all the other euro-science-charity projects, produced nothing (yeah, OK, there is a Czech English lexicon, which is a huge deal, right?).

The website of Asia Online looks nice. Obviously, lots of Wikipedia articles, SMT for dummies complete with scientifically-looking formulas, media buzz. Not a word about the actual engine, no screenshots, no demos.

One thing that particularly captured my attention was the Careers page. As of now, they are looking for:

  • country office managers in every Asian country, including Thailand

  • "content procurement" managers that read content and make sure it is well-translated (good luck with that) before feeding it to the lexicon builders

  • the coolest thing: programmers in Thailand: C++ - probably to fix bugs in Moses, and C# to hack the front-end. Now this last part is kind of OK, except that one of the requirements is "Have database skills in MS SQL Server, Oracle Database and etc". Data and stuff. In other words, on this stage they did not decide yet what backend they are going to use. Which means, there are no design specs.

So they have no people responsible for the content. They have no design specs. They have no solid plan. They have no people who worked and produced anything of this class.

They did not even research their markets properly. If they even looked at Wikipedia articles about the Philippines, or talked to one or two Filipinos, they'd learn that English, not Tagalog, is the lingua franca of the Philippines when it comes to written language. 90% of the major newspapers, all the official correspondence, all the commercial documents in the Philippines are in English.

All they have is the prototype of a system that was never shown to work in a real-world environment, and a bunch of British guys who rented an office in "beautiful downtown Bangkok" and announced that they will conquer the world in a year.

And, of course, the usual reservations about statistical MT apply.

Tuesday, March 11, 2008

Carabao Language Kit released

The version is now available for download - mostly to fix the regressions reported in


  • Crash when using sequence extraction option (regression from


  • Capability to import sequences by data entry directly from the Sequence Sheet

  • Capability to manually set sequence descriptions
  • Some sequences for multi-word entity extraction
  • More morphological exceptions for Russian


  • Processing speed and memory consumption - further boost

  • Token Sheet (words & sequences) GUI

Monday, March 3, 2008

What may happen in the next 100 years - predictions from 1900

A scan of an interesting article from December 1900's Ladies' Home Journal: What May Happen in the Next 100 Years

Some of the predictions are astonishingly accurate. Some are quire funny.
There's a prediction about language:

There will be No C, X or Q in our every-day alphabet. They will be abandoned because unnecessary. Spelling by sound will have been adopted, first by the newspapers. English will be a language of condensed words expressing condensed ideas, and will be more extensively spoken than any other. Russian will rank second.

Of course, it was before two world wars, which introduced corrections into the statistics.

Well done with the "condensed words" and "condensed ideas", and the propagation of English (definitely not a given in 1900).