Monday, December 8, 2008

Carabao Language Kit 1.2.3.0 released

The version 1.2.3.0 is now available for download.

Fixed:

  • Handling of single quotes as syntax delimiters in English

Added:

  • A segmentation mode more effectively handling languages that don't use white spaces (e.g. Chinese, Japanese, Korean, Thai). In this mode, different character classes are broken into tokens (e.g. Chinese, and then immediately English). The remaining unidentified part is run through unknown heuristic identifier.

  • Automatic conversion for Unicode clipboard data into the currently active encoding in tokens table

  • Better warning when attempting to overwrite the current token

  • A utility to rebuild semantic links cache

Improved:

  • In some systems, the table of tokens with every update was adding a new set of system icons (minimize, restore, maximize) to the MDI frame window. The maximize option now causes the window to be set roughly to the full client area, but not in maximize mode

Monday, September 8, 2008

Free source code section

We added a small source code section on our Download page, where we will post freebies for developers.

Carabao Language Kit 1.2.0.0 released

The version 1.2.0.0 is now available for download.

Fixed:

  • Unknown patterns were translated as hypernyms

  • Regression: certain category-based sequences were omitted on second execution because of a malfunctioning guess scan caching mechanism

  • In analytical mode (Carabao DeepAnalyzer), there was a mismatch between word index number and an idiom member index, in sentences with attached tokens such as 'em, 'm

  • When copying a token with 1 rule units or less, the text is always reset to the original

Added:

  • Capability to match numbers as patterns

  • When a translation is not found, the engine tries to fall back to a matching hypernym instead

  • New methods to Carabao DeepAnalyzer that enable accessing the members of the detected idioms

  • New methods to Carabao CDA that enable accessing the unknown heuristics table

  • New sequences

  • Russian morphological exceptions

Improved:

  • If an "unknown pattern" is forced to match a known word, it will not create a new guess if a guess with a same hypernym already exists. For example, if you force to check, whether a known word can be a city, a new record will not be created, if there is already a guess with a known city
    Automatic input language switching in locator fields
  • Locator fields are pre-filled with the list of all existing languages in the database, eliminating the need to jump to the next language

Thursday, June 5, 2008

The surreal world of VoIP

We're currently working on a massive project involving telephony and mobile technologies, and I had to look for VoIP vendors to cater for relatively simple needs of my client. I have to say that while the needs are simple, the traffic is extremely high, so in monetary terms, it could be a nice deal for the VoIP vendors.

But... is VoIP a strange industry or what. I don't know who runs all these companies - but:
  • if you don't know what IVR, DID, PSTN, or all their other "secret handshakes" mean - they won't even talk to you. Forget about the forums, they are even less helpful than Usenet.
  • the responses come usually after weeks, and they are of the type "I just transferred your inquiry to our sales representative". Obviously, unless you kick and scream, the sales rep won't get back
  • you have to re-tell the story over and over again

Ah yes, but the internet is full of dotcomish optimism about mashups and other kewl stuff. Awesome, dude.

Of course, there are also people who do need paying customers, and those who are able to concentrate, and - surprise surprise - they were the ones who got the job eventually.

Thursday, May 1, 2008

Published in MultiLingual

My article about real-world applications of machine translation has been published in MultiLingual Computing, the leading industry magazine for globalization, international software development and language technology.

This part is for subscribers only though.

Wednesday, April 23, 2008

We are published at ELRA

After a few months of evaluations, agreements, and inspections, our linguistic data is published at European Linguistic Resources Association's website. The Russian - English OLIF dictionary is sold at quite a price, while the freebie Swahili, Czech and Cebuano dictionaries are distributed for free (although ELRA takes postage and media charges).

It is important to mention that all this data can be created from (usually free) ASCII dictionaries on the net using Carabao Linguist Edition.

Clarification: OLIF is Open Lexicon Interchange Format backed by SAP, especially created for NLP oriented lexica. The official website is www.olif.net.

Tuesday, April 1, 2008

Server transition

We just moved to a new server. Much better performance, but there might be some minor technical glitches in the next few days. Thank you for your patience.

Friday, March 14, 2008

Hype, hype, hype

Got this article from TAUS today.

Man, is this annoying or what. With claims like these, the MT industry is going to move from Web 2.0 to Dot Com Bubble Burst 2.0.

I do hold high respect for Asia Online's team. Phillip Koehn co-authored some papers with Franz Och, the guy behind Language Weaver and Google Translate. Dion Wiggins has some super-duper credentials as an IT businessman, although with no experience at all in MT or NLP (and this industry is very different).

From the technological point of view, Koehn did the right thing (IMHO) going in the hybrid direction instead of looking for the philosopher's stone of pure ideal engine that builds itself and with some kind of divination deduces stuff that hand-crafted rules can't. It is also refreshing to see that they decided to use a high quality corpus.

But - c'mon, people!

The article contains odd bits such as:

...adding a syntax component to a SMT system "seriously degrades throughput performance from 5,000 words a minute to only around 300" on a machine with 4 high speed CPUs

Did they mean the SMT system? Because how can you measure speed of all SMT systems?

If this is the SMT system (i.e. they already built the kernel), this probably means they are going to use Moses. Which calls for more questions. Koehn created Pharaoh and Moses; both are open-source; both have been around for a while; yet I never heard of a commercial or even semi-commercial application that uses them. And I know at least one huge translation agency that launched a project to create their own MTs based on Moses.

There is also EuroMatrix, an all-you-can-spend research project, where Koehn also took part, and which, just like all the other euro-science-charity projects, produced nothing (yeah, OK, there is a Czech English lexicon, which is a huge deal, right?).

The website of Asia Online looks nice. Obviously, lots of Wikipedia articles, SMT for dummies complete with scientifically-looking formulas, media buzz. Not a word about the actual engine, no screenshots, no demos.

One thing that particularly captured my attention was the Careers page. As of now, they are looking for:



  • country office managers in every Asian country, including Thailand

  • "content procurement" managers that read content and make sure it is well-translated (good luck with that) before feeding it to the lexicon builders

  • the coolest thing: programmers in Thailand: C++ - probably to fix bugs in Moses, and C# to hack the front-end. Now this last part is kind of OK, except that one of the requirements is "Have database skills in MS SQL Server, Oracle Database and etc". Data and stuff. In other words, on this stage they did not decide yet what backend they are going to use. Which means, there are no design specs.


So they have no people responsible for the content. They have no design specs. They have no solid plan. They have no people who worked and produced anything of this class.

They did not even research their markets properly. If they even looked at Wikipedia articles about the Philippines, or talked to one or two Filipinos, they'd learn that English, not Tagalog, is the lingua franca of the Philippines when it comes to written language. 90% of the major newspapers, all the official correspondence, all the commercial documents in the Philippines are in English.

All they have is the prototype of a system that was never shown to work in a real-world environment, and a bunch of British guys who rented an office in "beautiful downtown Bangkok" and announced that they will conquer the world in a year.

And, of course, the usual reservations about statistical MT apply.

Tuesday, March 11, 2008

Carabao Language Kit 1.1.0.1 released

The version 1.1.0.1 is now available for download - mostly to fix the regressions reported in 1.1.0.0.

Fixed:



  • Crash when using sequence extraction option (regression from 1.1.0.0)


Added:



  • Capability to import sequences by data entry directly from the Sequence Sheet

  • Capability to manually set sequence descriptions
  • Some sequences for multi-word entity extraction
  • More morphological exceptions for Russian


Improved:



  • Processing speed and memory consumption - further boost

  • Token Sheet (words & sequences) GUI

Monday, March 3, 2008

What may happen in the next 100 years - predictions from 1900

A scan of an interesting article from December 1900's Ladies' Home Journal: What May Happen in the Next 100 Years

Some of the predictions are astonishingly accurate. Some are quire funny.
There's a prediction about language:

There will be No C, X or Q in our every-day alphabet. They will be abandoned because unnecessary. Spelling by sound will have been adopted, first by the newspapers. English will be a language of condensed words expressing condensed ideas, and will be more extensively spoken than any other. Russian will rank second.


Of course, it was before two world wars, which introduced corrections into the statistics.

Well done with the "condensed words" and "condensed ideas", and the propagation of English (definitely not a given in 1900).

Thursday, February 28, 2008

Carabao Language Kit 1.1.0.0 released

The version 1.1.0.0 is now available for download.

Fixed:

  • Volatility of newly assigned rule units in late sequences
  • Inconsistencies in the generation of inflected forms in design time

Added:

  • Friendly GUI of meta-rules such as lemmatized forms and generation of inflected forms
  • MorphoLogic now inspects the design time data generation meta-rules when generating inflected forms

Improved:

  • Processing speed and memory consumption
  • Increased maximum length of the meta-rule content field
  • Increased some fields to accommodate large sequences and a lot of grammatical data
  • Concurrency during long processing
NOTE: if you are upgrading from 1.0 and would like to keep your data, please run convertTo11.exe executable on your data.

Sunday, February 24, 2008

Our products are now available at ComponentSource

ComponentSource, the largest online reseller of software components, is now selling Carabao DeepAnalyzer, with Carabao MorphoLogic and Carabao Translation Server on the way. Here is a direct link to our page:

http://www.componentsource.com/features/digital-sonata/index.html

It took us a while (over 2 months) to sign up, with all the checks, examinations, questions, and reviews.

ComponentSource provides the corporate customers a more convenient mode of purchase, compliant with their supply chain procedures, and establishes higher visibility for our products.

Why is MT so formal?

I came across an interesting discussion about machine translation in LinkedIn:

http://www.linkedin.com/answers/international/internationalization-localization/INT_INZ/172005-2191793

Among the obvious stuff (obligatory "spirit is willing, flesh is weak", "out of sight, out of mind" quotes and recommendations from professional translators to hire professional translators instead), there was one curious comment that machine translation is "unnecessary formal".

Brushing aside the exaggerated expectations (you don't expect your computer to have a Jerry Seinfeld inside, do you?), I now recall that when I myself first encountered an MT software (it was PARS in early 1990s), what struck me was the unnecessarily formal style of the output (OK, nowadays they also have SMTS, which produces "porridge o' words" style).

Really, why does it have to be so formal?

If you fly often, and it's usually not business class, then probably you developed strong aversion for airplane food and collocations like "sky chefs". While usually food is well-preserved and reasonably fresh, it rarely tastes like real food with real flavour. I love spicy food and I frequently fly Asian airlines, but I never got to taste real spice there. Aside from the safety concerns (sick people on the plane don't really make it fly faster), I think the reason is that they are aiming for the bland, politically correct, acceptable, good enough by everyone average. No one gets offended. No one gets hurt.

It is the same with MT. The formality is not a product of technical limitations. It is possible to implement all styles even in older generation systems, but it is more difficult to maintain them. So essentially, the developer needs to pick one style. And if it has to be one style, the best bet is for a bland, politically correct, formal language.

And just like in the case of the airlines that do not set the goal of providing a unique culinary experience, no MT system ever promised to produce a literary masterpiece. Just as you pay the airlines to get you from point A to point B, you use MT to get your "cargo" from a source language to a target language, with as little damage to the wares as possible.

Wednesday, January 23, 2008

Carabao Language Kit 1.0.0.3 released

The version 1.0.0.3 is now available for download.

Fixed:

  • Various tagging problems
  • A bug with mid-sentence sequences priority setting
  • Generation of lemmas from the canonic form for tagging-only affixes

Added:

  • A button to tag new entries morphologically
  • A handful of commonly used business entities (e.g., address, phone, fax, business hours)

Improved:

  • Accuracy of some sequences
  • Domains

Monday, January 14, 2008

A discussion on sentiment / opinion extraction

An interesting discussion about sentiment / opinion extraction in Yahoo! TextAnalytics group, initiated by Seth Grimes:

http://tech.groups.yahoo.com/group/TextAnalytics/message/204

I get mentioned somewhere in the middle (look for Digital Sonata):
http://www.b-eye-network.com/view/6744 - with exactly the opposite of what I said :-) .

Monday, January 7, 2008

On the importance of localization

Do you have a Facebook account? If your native tongue is English, chances are that you do. Otherwise, it is far, far from certain.

The geographic distribution of the social network users varies greatly from network to network. Orkut is an Indian / Brazilian / Pakistani domain; Friendster is Filipino / Chinese; Facebook is mostly used in Anglophone countries. In addition to the more elaborate reasons, such as an average mindset in a particular country (e.g., some are more after pictures, others like to argue and write essays, etc.) - there is one very simple reason. The users either have to strain themselves to use a non-native GUI, or are simply unable to use it. Believe it or not, an average person on the planet earth is only fluent in one language, especially if his/her native tongue is one of the widespread ones.

Recently I witnessed something really cool. A small local social network took a dead grip on a huge market of 300 million or so. Facebook, along with the other giants, seems to be oblivious of this.

Russians, or rather people associated with ex-USSR, usually do not possess good English skills. Russian is still lingua franca throughout the entire ex-USSR space, and is a preferred medium of communication among millions of migrants from there.

So when Odnoklassniki.ru, a small website offering people to get in touch with their ex-classmates was launched (the literal translation of the word "odnoklassniki" is "classmates"), it seemed to land on a right spot. Nobody in ex-USSR used Facebook. It would be equally absurd to expect the Russians to use Baidu, the main Chinese search engine. Odnoklassniki.ru grew up at a rate which is hard to believe.

I got an invitation from an ex-classmate in my primary school in Moscow. I usually discard invitations to social networks, they are already too many to keep track on. Just out of curiosity I went there, to discover 4 or 5 of the classmates I haven't seen in 15 years or so, among about 300 ex-students of the same school. That was two moths ago. Now over 2,000 users are associated with this school. (Obviously, it doesn't even exist on Facebook.) As of now, the majority of people I wanted to reconnect with, are there. The website is technically average (makes my Opera hang sometimes) and has too many ads, but it is unlikely I'll ever leave it for anything else.  It simply does the job.

The only problem I have is that when I give the URL of my photo album in Flickr, very few are able to use it. Why? You guessed it right: because Flickr does not have a localized version for Russian (despite the funky hello messages in 150 languages), so even navigating through the registration page is too difficult for most.

So I wonder, will they get it? I say, no. Which means, there will be a Chinese Facebook, a Japanese Facebook, and a French Facebook, and there is absolutely no chance these guys will recapture the non-Anglophone markets.