Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Abstract

Recent advancements in large reasoning models (LRMs) have introduced anintermediate "thinking" process prior to generating final answers, improvingtheir reasoning capabilities on complex downstream tasks. However, thepotential of LRMs as evaluators for machine translation (MT) quality remainsunderexplored. We provides the first systematic analysis of LRM-as-a-judge inMT evaluation. We identify key challenges, revealing LRMs require tailoredevaluation materials, tend to "overthink" simpler instances and have issueswith scoring mechanisms leading to overestimation. To address these, we proposeto calibrate LRM thinking by training them on synthetic, human-like thinkingtrajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that thisapproach largely reduces thinking budgets by ~35x while concurrently improvingevaluation performance across different LRM scales from 7B to 32B (e.g.,R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). Thesefindings highlight the potential of efficiently calibrated LRMs to advancefine-grained automatic MT evaluation.

Quick Read (beta)

loading the full paper ...