12 Critical Flaws of BLEU
Why you shouldn’t trust BLEU according to 37 studies published over 20 years
BLEU is an extremely popular evaluation metric for AI.
It was originally proposed 20 years ago for machine translation evaluation, but it is nowadays commonly used in many natural language processing (NLP) tasks. BLEU has also been recently used to evaluate large language models, such as OpenAI GPT-3, Google PaLM, and Amazon AI AlexaTM.
BLEU is widely used, but is it a good metric?
In this article, I investigate 20 years of research publications criticizing BLEU and exposing its main flaws.
I reference 37 scientific papers published from 2003 to 2022 that demonstrate the critical limits of BLEU, illustrate common misuses by the scientific community, and provide compelling reasons for its retirement.
In particular, we will see why evaluations based only on BLEU are mostly meaningless and should not be trusted.
I first focused on machine translation since BLEU was proposed for this task. Further in this article, I also investigate the validity of BLEU in tasks outside machine translation.
Note: All the claims, statements, and suggestions I provide in this article are all coming from independent studies (except for one that was written by me). They were all peer-reviewed by experts and cited in related scientific work.
Before diving into this article, if you are unfamiliar with BLEU, I recommend reading my previous article presenting BLEU:
BLEU poorly correlates with human judgments
In 2002, when the BLEU paper was published (Papineni et al., 2002), it presented results showing a strong correlation with human evaluations. In other words, if we ask humans to rank translations according to their quality, BLEU should yield a similar ranking.
Nonetheless, only one year later, at the conference MT Summit 2003, the first studies were published by experts in machine translation showing that indeed BLEU correlates well with human evaluations but only in particular settings (Akiba et al., 2003).
From there, thanks to an improved methodology to evaluate metrics and the availability of more and more datasets of human judgments, there has been a succession of scientific studies demonstrating over and over again that BLEU poorly correlates with human judgments.
BLEU and humans often disagree on ranking translations according to their quality.
One of the most cited works questioning the correlation of BLEU with human judgments is the work by Callison-Burch et al. (2006): “Re-evaluating the Role of BLEU in Machine Translation Research.” They wrote that BLEU is:
“[…] not sufficient to reflect a genuine improvement in translation quality, and in other circumstances that it is not necessary to improve Bleu in order to achieve a noticeable improvement in translation quality”
Their study shows that improving BLEU cannot be enough to claim an improvement of translation quality. From this observation, they even called for the “re-examination of past work which failed to show improvements in Bleu.”
A call that largely remained ignored. BLEU still became more and more popular in machine translation research.
Simultaneously to Callison-Burch et al. (2006), at the same conference (EACL 2006), Belz and Reiter (2006) also questioned BLEU and found that it doesn’t correlate well with expert humans.
In 2007, even though BLEU had been shown unreliable, it was “the de facto standard in machine translation evaluation” (Callison-Burch et al., 2007)
Tan et al. (2015) wrote about a “schism between BLEU and manual evaluation” highlighting “the poor correlation between MT systems with low BLEU scores and high manual evaluation scores.” They found that their system’s high BLEU improvements were actually of worse quality than baseline translations according to humans.
Smith et al. (2016) demonstrated that translations with a very high BLEU score can be of very poor quality, or worse, unintelligible. Let’s have a look at their examples of translations from German to English:
Example 1
SRC: in diesem sinne untergraben diese maßnahmen teilweise das demokratische system der usa .
REF: in this sense , the measures will partially undermine the american democratic system .
MT: in this sense , undermine these measures in the democratic system of the united states .
BLEU-OPT: the democratic system in this sense , the measures partially undermine the american .
Example 2
SRC: am wichtigsten ist es aber , mit seinem arzt zu sprechen , um zu bestimmen , ob er durchgefuhrt werden sollte oder nicht .
REF: but the important thing is to have a discussion with your doctor to determine whether or not to take it .
MT: the most important thing is , however , with his doctor to speak , in order to determine whether it should be carried out or not .
BLEU-OPT: the important thing is to have a doctor performed but , with to take it . talking to determine whether or not to s
In these examples, “SRC” is the source German sentence to translate. “REF” is the ground truth reference translation used to compute BLEU. “MT” is the translation generated by a regular machine translation system. “BLEU-OPT” is a system optimized to generate higher BLEU scores by directly exploiting “REF.”
The translations by “BLEU-OPT” have higher BLEU scores than the regular translations generated by “MT”, yet they are clearly worse.
The same year, Arle Lommel in BLUES for BLEU proposed another study of the correlation of BLEU with human judgments. Arle Lommel pointed out that BLEU is often used to draw conclusions in machine translation research papers while it is an unreliable measure of MT quality:
“BLEU is simply a measure of string similarity to a particular reference, it is not evaluating ‘quality’ in any sense that really corresponds to human understanding (even if we see some correlation in judgments)”
“BLUES for BLEU” called “into question the significance of a considerable body of MT research that relies on the use of small differences in BLEU (or other reference-based) scores to demonstrate system improvement or to compare systems”
Another work by Kann et al. (2018) tried to use BLEU to evaluate fluency but then wrote “Its correlation with human scores was so low that we do not consider it in our final experiments.”
In 2020, Mathur et al. (2020) received an outstanding paper award from the ACL for their work calling for “retiring BLEU as the de facto standard metric.”
They showed that some of the strong correlations observed in previous work between BLEU and humans were partly due to the presence of outlier systems used in metric evaluation datasets that are very easy to discriminate by BLEU. They wrote that the “resulting high values of correlation can then lead to false confidence in the reliability of metrics.” Given this observation, they suggest stopping using BLEU in machine translation evaluation, and that claiming that a system is better based on a small BLEU difference has only little meaning.
Freitag et al. (2022) optimized their systems to improve BLEURT, another metric, and obtain significantly better translation quality according to humans. This improvement in quality yielded a large drop of 8.9 BLEU points in their experiments.
This long line of independent studies, published from 2003 to 2022, clearly shows that BLEU poorly correlates with human evaluations in many cases, confirming over and over again findings from previous studies, and from different angles.
Note: This is not an exhaustive survey. Many other studies published have demonstrated a poor correlation between BLEU and human evaluations. For instance, the WMT conference yearly publishes the findings of a metric evaluation shared task studying the correlation of BLEU with human judgments. I didn’t list all of them to focus on the points that I believe are the most important and convincing.
BLEU is blind to most of the valid translations
BLEU only considers valid translations selected as reference translations. This is probably its most critical limit.
Zhou et al. (2006) describe it as follows: “lack of support for paraphrase matching and the absence of recall scoring. BLEU-esque procedures recognize equivalence only when two n-grams exhibit the same surface-level representations, i.e. the same lexical identities.”
The reference translations exploited by BLEU are only a very small sample of all the valid translations.
In the years following the introduction of BLEU, many other metrics were proposed to fix this issue. The most popular one is probably METEOR (Banerjee and Lavie, 2005) which implements components that also reward translations that don’t exactly match the reference translations but use synonyms or even paraphrases of the reference translations. METEOR has often been used along with BLEU but never replaced BLEU.
BLEU poorly exploits multiple references
When BLEU was proposed in 2001, it was mainly for use cases where multiple reference translations are available. By exploiting more reference translations, BLEU can also reward more valid translations. Intuitively, using more reference translations should also increase the correlation with human judgments. In practice, several studies showed that this is not always the case.
Zhou et al. (2006) observed that, when running BLEU for the NIST 2003 Chinese machine translation evaluation, they obtained a better correlation with human judgments when using one reference translation instead of the four references available. Turian et al. (2003) reported similar observations on the same dataset but for the Arabic language.
Almost 15 years later, Freitag et al. (2020) showed that the nature of the reference translations used is more important than their number. For instance, using reference translations that are not actually translations but original text will have more impact on the correlation with human judgments than using several real translations that usually exhibit poor diversity and translation errors.
They also observed that “multireference BLEU does not improve the correlation for high-quality output.”
BLEU is unsuitable to evaluate short text
BLEU has been proposed to evaluate the translation quality of documents. It was never meant to be used for the evaluation of translations at a finer grain. Yet, BLEU is still often used at the sentence level.
Several works have demonstrated that the use of BLEU for sentence evaluation is unsuitable. For instance, Song et al. (2013) found that BLEU is “unreliable at the sentence or sub-sentence levels.”
According to Fomicheva and Specia (2019), BLEU harshly penalizes any differences between the machine translation output and the reference when it is used at the segment level.
BLEU is sensitive to tokenization
The dependence on tokenization is a very common flaw among evaluation metrics for NLP.
BLEU is computed on sequences of tokens. Consequently, changing the tokenization of the translation to evaluate will change the BLEU score. This behavior is often confusing or underestimated, even by machine translation experts. It prevents the comparison of BLEU scores computed with different tokenizations (Post, 2018; Nekvinda and Dušek, 2021).
The tokenization is a parameter of the metric. Using a different tokenization should be considered as using a different metric. Moreover, depending on the tokenization, the correlation between BLEU and human judgments isn’t the same.
For illustration, Post (2018) shows how a BLEU score can be impacted given different tokenizations as follows:
BLEU is unsuitable for evaluating translations of high quality
Babych and Hartley (2008) showed that BLEU tends to reward translations less easily as the quality goes up.
This is clearly a much more significant issue today since machine translation systems are now reaching a translation quality close to humans. Further improvements in translation quality are likely to remain unrewarded by BLEU.
BLEU lacks discriminative power
This flaw is related to the main use of BLEU by the research community, i.e., ranking systems.
BLEU may correlate very well with human evaluation when there is a clear gap in translation quality between systems. However, when the translation quality is similar, BLEU has been reported to be unable to find which system is better (Mathur et al., 2020).
Callison-Burch et al (2006) also show that a given BLEU score can be obtained from a large number of variations of a translation, as illustrated below.
Both translations (shown on a white background in the figure) have similar BLEU scores, but the second one has been judged better by humans.
Among all the variations that yield the same BLEU score, many will actually lead to better or worse translations according to human evaluations.
BLEU lacks interpretability
What is a good BLEU score?
For instance, if a system obtains 60 BLEU points, does it mean that it is a good system?
Among machine translation researchers, we can often hear discussions about systems that would be good or bad just based on their absolute BLEU scores. For instance, 60 BLEU points are rarely seen in the machine translation research literature and we may be tempted to say a system that performs at 60 BLEU, or above, is very good.
We shouldn’t.
As we saw, many parameters have a significant impact during the computation of BLEU scores. For instance, it is well-known that using more reference translations or smaller tokens can dramatically increase a BLEU score.
Hamon and Mostefa (2008) state that absolute BLEU scores can’t be interpreted:
“The absolute variation of the BLEU scores cannot be interpreted as a difference in MT system quality. It rather shows that comparing systems is the only plausible thing with BLEU, instead of evaluating systems in an absolute way.”
Reiter and Belz (2009) also conclude that BLEU does “not provide a useful measure of content quality.”
BLEU overestimates adequacy
As soon as 2004, Babych and Hartley (2004) demonstrated that a major issue of BLEU is its lack of weighting for important words. BLEU rewards all the tokens the same, i.e., the ones that are critical to the understanding of the translation are not more valued than the punctuation marks.
A translation may be highly ranked by BLEU despite having all the keywords incorrectly translated.
Later, Koehn and Monz (2006) found that “BLEU is very sensitive to jargon, to selecting exactly the right words, and not synonyms that human judges may appreciate as equally good. ”
BLEU exhibits technological biases
Akiba et al. (2003) found that BLEU significantly correlates with human judgments only when the machine translation systems are of similar types.
Callison-Burch et al. (2006) wrote that the use of BLEU is valid only to evaluate systems using “similar translation strategies.” Koehn and Monz (2006) found that rule-based systems and statistical systems are not equally appreciated by BLEU.
Popović et al. (2016) also show that if the reference translation used by BLEU was produced from a machine translation output, through post-editing, BLEU would be biased towards the system that generated the machine translation output.
Edunov et al. (2020) demonstrate that BLEU is also biased by the nature of the training data used by neural machine translation systems.
As stated by Reiter (2018):
“BLEU has technological biases that we do not understand. This is especially worrying as new technologies such as neural networks become more prominent; we do not know if BLEU is ‘fair’ to such technologies.“
BLEU is tied to poor scientific credibility
BLEU is only a metric that measures the distance between two sequences of characters. It was never meant to replace human evaluation or to validate scientific hypotheses.
Yet, I showed in a study of 769 evaluations (Marie et al., 2021) that most machine translation research papers (74%) exclusively rely on comparisons between BLEU scores to draw conclusions. Note: This is the only study in which I was involved that I reference in this blog article.
Comparing BLEU scores to decide whether a system is better than another one has always been widely accepted by the research community, and still is today.
Even if BLEU were perfectly correlating with human judgments, a significant amount of machine translation research studies would still have low scientific credibility mainly due to the comparisons between BLEU scores that were computed differently, for instance using different tokenization (Post, 2018). These BLEU scores are not comparable.
I identified that more than one-third of the machine translation research papers potentially exhibit comparisons of uncomparable BLEU scores (Marie et al., 2021).
More recently, I identified that this incorrect use of BLEU also impacts the evaluation of large language models, such as Whisper, GPT-3, and No Language Left Behind.
The use of copied BLEU scores is also de facto associated with an absence of statistical significance testing despite its necessity (Kocmi et al., 2021).
These issues are not inherent to BLEU itself but rather due to its very long history of misuse by the scientific community that became acceptable.
BLEU is unsuitable outside of machine translation evaluation
BLEU is a machine translation metric. Nonetheless, it is also often used in other NLP tasks, usually without previous work validating the use of BLEU in such tasks.
There is actually a significant amount of work that invalidates BLEU, or at least demonstrates a poor correlation with human judgments, in tasks other than machine translation.
Paraphrase generation
Wubben et al. (2010) only found an overall medium correlation between BLEU and human evaluations for paraphrase generation. They don’t recommend the use of BLEU for this task.
Dialogue response generation
BLEU is among the most popular evaluation metrics used for dialogue response generation.
However, in dialogue, the space of valid responses is even larger than in translation (Nekvinda et al., 2021). Constraining the validity of the responses to a set of given references makes little sense.
Galley et al. (2015) show that even with many references, the correlation between BLEU and human evaluations is very weak. Liu et al. (2017) reached similar conclusions.
According to Nekvinda et al. (2021), using BLEU to assess the fluency of the response should be avoided.
Text simplification
As shown by Sulem et al. (2018), there is no correlation between BLEU and grammaticality/meaning preservation in sentence simplification where sentence splitting is involved. They generalize by stating that BLEU often “negatively correlates with simplicity, essentially penalizing simpler sentences.”
This statement was later nuanced by Alva-Manchego et al. (2021), but they still discourage the use of BLEU for text simplification: “BLEU is a bad metric to estimate Structural Simplicity in system outputs where sentence splitting was performed.”
Grammatical error correction (GEC)
BLEU negatively correlates with human rankings in GEC tasks as demonstrated by Grundkiewicz et al. (2015), Napoles et al. (2015), and Napoles et al. (2019).
Automatic summarization
A good summary should be coherent, consistent, fluent, and relevant, but BLEU poorly correlates with all these characteristics as shown by Fabbri et al. (2021).
Image caption generation
Elliott and Keller (2014) argue that unigram BLEU should not be used for image caption evaluation due to its weak correlation with human accuracy judgments.
Natural language generation (NLG)
All the tasks above belong to the field of NLG, including machine translation.
Novikova et al. (2017) go even further by showing that the assumptions made by BLEU, such as that human-generated references are correct and complete, are invalid for NLG in general.
Scientific hypothesis testing
Reiter (2018) invalidates the use of BLEU outside of MT, for evaluation of individual texts, and for scientific hypothesis testing.
“the evidence does not support using BLEU to evaluate other types of NLP systems (outside of MT), and it does not support using BLEU to evaluate individual texts rather than NLP systems.”
Conclusion
To sum up, BLEU is not a good metric mainly because of 12 critical flaws:
BLEU poorly correlates with human judgments
BLEU is blind to most of the valid translations
BLEU poorly exploits multiple references
BLEU is unsuitable to evaluate short text
BLEU is sensitive to tokenization
BLEU is unsuitable for evaluating translations of high quality
BLEU lacks interpretability
BLEU overestimates adequacy
BLEU lacks discriminative power
BLEU exhibits technological biases
BLEU is tied to poor scientific credibility
BLEU is unsuitable outside of machine translation evaluation
I kept for this conclusion two citations that point out the negative impact that BLEU has on the research itself.
“the fact that Bleu’s correlation with human judgments has been drawn into question may warrant a re-examination of past work which failed to show improvements in Bleu”
This was written in 2006 (Callison-Burch et al., 2006) and published by renowned machine translation researchers. This call didn’t have any noticeable impact.
15 years later, in 2021, a research team from Microsoft published (Kocmi et al., 2021):
“we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions.”
and
”the common use of BLEU over the past years has possibly negatively affected research decisions.”
When I worked on this article, what struck me the most is that there are only a tiny amount of positive studies about BLEU. As soon as 2003, researchers were already convinced that the metric has serious limits that make it unsuitable for many tasks.
In 2022, BLEU is still the most used metric in machine translation by far and became also popular in the evaluations of large language models.
Is there no alternative to BLEU?
In my study (Marie et al., 2021), I counted 100+ other metrics that were proposed since 2010 and that could be alternatives to BLEU. Most of them have never been used and were quickly forgotten due to the strong dominant position of BLEU.
If you are evaluating AI models for language generation, and more particularly machine translation, have a look at BLEURT and COMET. They have also flaws but at least they have a much better correlation with human judgments than BLEU.
References
Yasuhiro Akiba, Eiichiro Sumita, Hiromi Nakaiwa, Seiichi Yamamoto, and Hiroshi G. Okuno. 2003. Experimental comparison of MT evaluation methods: RED vs.BLEU. In Proceedings of Machine Translation Summit IX: Papers, New Orleans, USA.
Fernando Alva-Manchego, Carolina Scarton, and Lucia Specia. 2021. The (un)suitability of automatic evaluation metrics for text simplification. Computational Linguistics, 47(4):861–889.
Bogdan Babych and Anthony F. Hartley. 2008. Sensitivity of automated mt evaluation metrics on higher quality mt output: Bleu vs task-based evaluation methods. In LREC.
Bogdan Babych and Tony Hartley. 2004. Extending the BLEU MT evaluation method with frequency weightings. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 621–628, Barcelona, Spain.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for
Computational Linguistics.Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–320, Trento, Italy. Association for Computational Linguistics.
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Prague, Czech Republic. Association for Computational Linguistics.
Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 249–256, Trento, Italy. Association for Computational Linguistics.
Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 452–457, Baltimore, Maryland. Association for Computational Linguistics.
Alexander R. Fabbri, Wojciech Krýscínski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
Marina Fomicheva and Lucia Specia. 2019. Taking MT evaluation metrics to extremes: Beyond correlation with human judgments. Computational Linguistics, 45(3):515–558.
Markus Freitag, David Grangier, and Isaac Caswell. 2020. BLEU might be guilty but references are not innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online. Association for Computational Linguistics.
Markus Freitag, David Grangier, Qijun Tan, and Bowen Liang. 2022. High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics. Transactions of the Association for Computational Linguistics, 10:811–825.
Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 445–450, Beijing, China. Association for Computational Linguistics.
Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Edward Gillian. 2015. Human evaluation of grammatical error correction systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 461–470, Lisbon, Portugal. Association for Computational Linguistics.
Olivier Hamon and Djamel Mostefa. 2008. The impact of reference quality on automatic MT evaluation. In Coling 2008: Companion volume: Posters, pages 39–42, Manchester, UK. Coling 2008 Organizing Committee.
Katharina Kann, Sascha Rothe, and Katja Filippova. 2018. Sentence-level fluency evaluation: References help, but can be spared! In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 313–323, Brussels, Belgium. Association for Computational Linguistics.
Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. 2021. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online. Association for
Computational Linguistics.Philipp Koehn and Christof Monz. 2006. Manual and automatic evaluation of machine translation between european
languages. pages 102–121.Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
Arle Lommel. 2016. Blues for bleu: Reconsidering the validity of reference-based mt evaluation. In Proceedings of the LREC 2016 Workshop Translation Evaluation–From Fragmented Tools and Data Sets to an Integrated Ecosystem.
Benjamin Marie, Atsushi Fujita, and Raphael Rubino. 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics.
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
Courtney Napoles, Maria Nadejde, and Joel Tetreault. 2019. Enabling robust grammatical error correction in new domains: Data sets, metrics, and analyses. Transactions of the Association for Computational Linguistics, 7:551–566.
Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China. Association for Computational Linguistics.
Tomáš Nekvinda and Ondrej Dušek. 2021. Shades of BLEU, flavours of success: The case of MultiWOZ. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 34–46, Online. Association for Computational Linguistics.
Maja Popovic, Mihael Arcan, and Arle Lommel. 2016. Potential and limits of using post-edits as reference translations for MT evaluation. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation, pages 218–229.
Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for computational Linguistics.
Ehud Reiter. 2018. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401.
Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
Aaron Smith, Christian Hardmeier, and Joerg Tiedemann. 2016. Climbing mont BLEU: The strange world of reachable high-BLEU translations. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation, pages 269–281.
Xingyi Song, Trevor Cohn, and Lucia Specia. 2013. Bleu deconstructed: Designing a better mt evaluation metric.
Elior Sulem, Omri Abend, and Ari Rappoport. 2018. BLEU is not suitable for the evaluation of text simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 738–744, Brussels, Belgium. Association for Computational Linguistics.
Liling Tan, Jon Dehdari, and Josef van Genabith. 2015. An awkward disparity between BLEU / RIBES scores and human judgements in machine translation. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 74–81, Kyoto, Japan. Workshop on Asian Translation.
Joseph P. Turian, Luke Shen, and I. Dan Melamed. 2003. Evaluation of machine translation and its evaluation. In Proceedings of Machine Translation Summit IX: Papers, New Orleans, USA.
Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2010. Paraphrase generation as monolingual translation: Data and evaluation. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics.
Liang Zhou, Chin-Yew Lin, and Eduard Hovy. 2006. Re-evaluating machine translation results with paraphrase support. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 77–84, Sydney, Australia. Association for Computational Linguistics.