Abstract
Automatic evaluation of translation remains a challenging task owing to theorthographic, morphological, syntactic and semantic richness and divergenceobserved across languages. String-based metrics such as BLEU have previouslybeen extensively used for automatic evaluation tasks, but their limitations arenow increasingly recognized. Although learned neural metrics have helpedmitigate some of the limitations of string-based approaches, they remainconstrained by a paucity of gold evaluation data in most languages beyond theusual high-resource pairs. In this present work we address some of these gaps.We create a large human evaluation ratings dataset for 13 Indian languagescovering 21 translation directions and then train a neural translationevaluation metric named Cross-lingual Optimized Metric for TranslationAssessment of Indian Languages (COMTAIL) on this dataset. The best performingmetric variants show significant performance gains over previousstate-of-the-art when adjudging translation pairs with at least one Indianlanguage. Furthermore, we conduct a series of ablation studies to highlight thesensitivities of such a metric to changes in domain, translation quality, andlanguage groupings. We release both the COMTAIL dataset and the accompanyingmetric models.