Friday, March 14, 2008

Hype, hype, hype

Got this article from TAUS today.

Man, is this annoying or what. With claims like these, the MT industry is going to move from Web 2.0 to Dot Com Bubble Burst 2.0.

I do hold high respect for Asia Online's team. Phillip Koehn co-authored some papers with Franz Och, the guy behind Language Weaver and Google Translate. Dion Wiggins has some super-duper credentials as an IT businessman, although with no experience at all in MT or NLP (and this industry is very different).

From the technological point of view, Koehn did the right thing (IMHO) going in the hybrid direction instead of looking for the philosopher's stone of pure ideal engine that builds itself and with some kind of divination deduces stuff that hand-crafted rules can't. It is also refreshing to see that they decided to use a high quality corpus.

But - c'mon, people!

The article contains odd bits such as:

...adding a syntax component to a SMT system "seriously degrades throughput performance from 5,000 words a minute to only around 300" on a machine with 4 high speed CPUs

Did they mean the SMT system? Because how can you measure speed of all SMT systems?

If this is the SMT system (i.e. they already built the kernel), this probably means they are going to use Moses. Which calls for more questions. Koehn created Pharaoh and Moses; both are open-source; both have been around for a while; yet I never heard of a commercial or even semi-commercial application that uses them. And I know at least one huge translation agency that launched a project to create their own MTs based on Moses.

There is also EuroMatrix, an all-you-can-spend research project, where Koehn also took part, and which, just like all the other euro-science-charity projects, produced nothing (yeah, OK, there is a Czech English lexicon, which is a huge deal, right?).

The website of Asia Online looks nice. Obviously, lots of Wikipedia articles, SMT for dummies complete with scientifically-looking formulas, media buzz. Not a word about the actual engine, no screenshots, no demos.

One thing that particularly captured my attention was the Careers page. As of now, they are looking for:

  • country office managers in every Asian country, including Thailand

  • "content procurement" managers that read content and make sure it is well-translated (good luck with that) before feeding it to the lexicon builders

  • the coolest thing: programmers in Thailand: C++ - probably to fix bugs in Moses, and C# to hack the front-end. Now this last part is kind of OK, except that one of the requirements is "Have database skills in MS SQL Server, Oracle Database and etc". Data and stuff. In other words, on this stage they did not decide yet what backend they are going to use. Which means, there are no design specs.

So they have no people responsible for the content. They have no design specs. They have no solid plan. They have no people who worked and produced anything of this class.

They did not even research their markets properly. If they even looked at Wikipedia articles about the Philippines, or talked to one or two Filipinos, they'd learn that English, not Tagalog, is the lingua franca of the Philippines when it comes to written language. 90% of the major newspapers, all the official correspondence, all the commercial documents in the Philippines are in English.

All they have is the prototype of a system that was never shown to work in a real-world environment, and a bunch of British guys who rented an office in "beautiful downtown Bangkok" and announced that they will conquer the world in a year.

And, of course, the usual reservations about statistical MT apply.


dionwiggins said...

Dear Vadim,

In response your blog post I would like to address some of the issues you have raised. Your are correct in some respects. There are the foundations of Philipp Koehn's and others work in our systems. It does not make sense to reinvent the wheel. Asia Online's engineers have taken elements of previous work that has been developed and added significant functionality. Asia Online has a team of over 150 staff working building out our engines in many languages and we are recruiting to find more. Using these resources gathered significant amounts of corpus and have added many features such as multiple level domain support. Ongoing work in underway in the areas of syntax trees and other approaches. It is that work that we are reference when we quote and such performance metrics.

The translation industry is wary of hype and we will demonstrate very soon our capabilities rather than just talk about them. In terms of providing access to our technology, we will be doing so in the near future. We are preparing to launch now. The LISA website has some screen shots and other examples from a presentation I delivered in Beijing earlier this month. We have been very quiet up until the LISA conference, we are now ready to talk about our systems. Some of our upcoming announcements and launches will further validate our company, business plan and platform.

In other areas, your assumptions are incorrect. We deal with content that comes from many platforms. As such we need to work with those platforms. Being able to integrate with Oracle, SQL Server, MySQL etc. is a requirement in order to be able to access and integrate that content into our systems.

In terms of market research, we have done extensive market research. I have personally lived in the Philippines for a number of years and my former wife is from the Philippines. You are not familiar with the details of our business plan or who we intend to address and at this time I am not willing to fully disclose all our plans. You do not plan a hand of cards by showing what you are holding before you bet . In terms of the Philippines, many in the Philippines can speak, read and write English. The skill level varies. In Manila it is reasonably high, especially the younger generation. But there are many who are not skilled in English or cannot speak it at all. The language of business may be English, but the language of the nation is Tagalog, in addition to a number of other dialects. If you go to a city like Cagayan De Oro, Davao, General Santos or similar, you will find the use of English significantly less mature. I speak Tagalog to a certain degree – mostly learnt from relatives and friends who's English was not good enough to communicate properly with me.

I suggest a little fact checking of your own before making "factual" references such as "all they have is the prototype of a system" or "a bunch of British guys" or "which means, there are no design specs " You will find that I am from New Zealand, Philipp is from Germany, Greg is from USA. We do have one "British guy", but we also have 150 staff. We are well beyond prototype. We have real systems and are preparing to launch some of our technology to public in the coming weeks. Also a project as large as ours does require design specs. We have worked very hard on our systems and you will see soon enough the results.

In the meantime, if you would like to see this for yourself, I invite you to come and visit us here in Bangkok. We would welcome your review and appreciate the opportunity to present our systems and rectify with facts and demonstrations the misconceptions you may have as to what our systems are capable of and the state of our company, corpus and engines.


Dion Wiggins

Chief Executive Officer

Asia Online

Vadim Berman said...

Dear Dion,

Thank you for your extensive comment. I probably should apologise for the tone on the post; it's just I'm confident that hype never does any good to anyone.

Having 150 employees (developers?) is remarkable for an MT company. I think even SYSTRAN's R&D team is significantly lower in number.

I noticed that you added a demo to your website. This definitely adds credibility, beta or not.

Apologies about the "British guys" part; it's the educational background of your team that made me assume you all were born in the UK.

I have to clarify the Philippines part. I was talking about the WRITTEN language. As you are not constructing speech-to-speech system, this is what is probably relevant to you.

As you probably can deduce from the names of our products, I also have strong ties to the Philippines. I have never been to the south which you describe, but for the rest:

* Manila is actually more of a "linguistic mixture" part, where Tagalog originates from (and this is why it was adopted in the first place, along with Manila's economic weight). Very frequently the Tagalog is, actually, "Taglish", which is when English and Tagalog words are used in the same sentence.

* Many people in Luzon use either Ilocano, local Igorot dialects, or English as native (yes, native) tongues

* Bisayans, who are the majority population-wise, strongly prefer English to Tagalog, even when their English is a bit pidgin. In fact, they have their own Bisayan version of national anthem.

* Finally, no business document, no legal document, and very few newspapers are in Tagalog

* Over 90% of Filipino bloggers do not blog in Tagalog

* Visit this website: . Switch to "Filipino". Note that the articles are still in English - which means that nobody cares to translate the content... Visit this: , this is where the people have their say.

The linguistic difficulties of your relatives and friends relate to the SPOKEN language. That's where the great divide lies; the TV is often in Tagalog (except for American movies which are ALWAYS left un-translated), the newspapers, the documents, even the signs are in English. Weird, huh?

I guess the question "do they speak English" in case of MT systems must be "do they READ English".

Thank you very much for your invitation. I replied privately to the email.