Abstract
The goal of translation, be it by human or by machine, is, given some text ina source language, to produce text in a target language that simultaneously 1)preserves the meaning of the source text and 2) achieves natural expression inthe target language. However, researchers in the machine translation communityusually assess translations using a single score intended to capture semanticaccuracy and the naturalness of the output simultaneously. In this paper, webuild on recent advances in information theory to mathematically prove andempirically demonstrate that such single-score summaries do not and cannot givethe complete picture of a system's true performance. Concretely, we prove thata tradeoff exists between accuracy and naturalness and demonstrate it byevaluating the submissions to the WMT24 shared task. Our findings help explainwell-known empirical phenomena, such as the observation that optimizingtranslation systems for a specific accuracy metric (like BLEU) initiallyimproves the system's naturalness, while ``overfitting'' the system to themetric can significantly degrade its naturalness. Thus, we advocate for achange in how translations are evaluated: rather than comparing systems using asingle number, they should be compared on an accuracy-naturalness plane.