Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Nitika Mathur, Timothy Baldwin, Trevor Cohn

Abstract Paper Share

Resources and Evaluation Long Paper

Session 9A: Jul 7 (17:00-18:00 GMT)
Session 10A: Jul 7 (20:00-21:00 GMT)
Abstract: Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.
You can open the pre-recorded video in a separate window.
NOTE: The SlidesLive video may display a random order of the authors. The correct author list is shown at the top of this webpage.

Similar Papers

On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald,
A representative figure from paper main.173
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation
Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, Kewei Tu,
A representative figure from paper main.333
Multi-Hypothesis Machine Translation Evaluation
Marina Fomicheva, Lucia Specia, Francisco Guzmán,
A representative figure from paper main.113
Fact-based Content Weighting for Evaluating Abstractive Summarisation
Xinnuo Xu, Ondřej Dušek, Jingyi Li, Verena Rieser, Ioannis Konstas,
A representative figure from paper main.455