If you're into machine translation, you might have noticed that one of the killer arguments in favour of the statistical MT is that these systems score higher on NIST evaluations (tests aimed to evaluate the accuracy of the output). NIST has selected IBM's BLEU as a method of measuring accuracy.
Google's statistical translation engine, for example, has beaten all the others. If you played a bit with Google Translate BETA pairs (all the others are SYSTRAN, so ignore them) and compared them with more traditional systems, you are probably as surprised as I was. While context detection works, of course, better (this is what statistical MT is for, anyway), long sentences are garbled, and the grammar is simply hopeless. Sometimes their dictionary harvesters result a new type of errors, confusing between proper names in the same category (I personally witnessed Abramovich, a Russian billionaire, being translated as Berezovsky, another Russian billionaire, and Vedomosti, a Russian newspaper, as Yahoo! - probably the original story was created by the former, and translated by the latter without bothering to mention the original). From the human point of view, most statistical MTs are definitely no better (frequently, much worse) than the traditional rule-based ones.
Is this just my impression that BLEU favours statistical MTs for no reason?
Turns out, not really. Eduard Hovy of University of Southern California published a paper dedicated to this topic: http://www.elra.info/mtsummit2007/Pres-3-Hovy.pdf
To cut a long story short: Prof. Hovy says that the reason is that BLEU, just like the statistical MTs, are counting the exact matches. As the rule-based dictionaries are built on broader definitions, these will be counted as errors as well (being absolutely correct). And, if there is a different, yet correct, order, this will be counted as an error.
I'd add another reason: BLEU compares lemmas, and frequently inflections (like morphological case in European languages) play paramount role in conveying the meaning (for example, accusative case means a direct object, and dative case can be translated as an indirect object or a complement; all this is ignored by BLEU). As mentioned above, statistical MTs are not very adept in grammar.
It might be worthy to add that BLEU seeks correlation on a corpus level. Statistical MTs learn from corpora as well. Given the limited number of high-quality corpora, I don't think it'd be surprising if both NIST's BLEU evaluator and the system being evaluated learn from the same corpus. So if both harvesters made the same mistake (quite possible), and a rule-based system was correct, guess who will be penalized.
What about NIST, do they relate to it somehow?
Yes, they do:
(look for Performance Measurement).
It is not that BLEU is the only one in the field. I personally consider METEOR much more promising, especially because it (finally!) takes synonyms into consideration. And, of course, the work never stops and there are numerous attempts to improve the correlation between the human perception and the evaluation mark.
So why was it still used in 2006 evaluation? I guess the central reason is that BLEU was the first one, and it takes time to change the procedures. And, it might not be so easy to kick IBM's stuff out.