A Set of Recommendations for Assessing Human-Machine Parity in Language Translation

Abstract

The quality of machine translation has increased remarkably over the pastyears, to the degree that it was found to be indistinguishable fromprofessional human translation in a number of empirical investigations. Wereassess Hassan et al.'s 2018 investigation into Chinese to English newstranslation, showing that the finding of human-machine parity was owed toweaknesses in the evaluation design - which is currently considered bestpractice in the field. We show that the professional human translationscontained significantly fewer errors, and that perceived quality in humanevaluation depends on the choice of raters, the availability of linguisticcontext, and the creation of reference translations. Our results call forrevisiting current best practices to assess strong machine translation systemsin general and human-machine parity in particular, for which we offer a set ofrecommendations based on our empirical findings.

Quick Read (beta)

loading the full paper ...