Abstract
Large-scale reinforcement learning (RL) methods have proven highly effectivein enhancing the reasoning abilities of large language models (LLMs),particularly for tasks with verifiable solutions such as mathematics andcoding. However, applying this idea to machine translation (MT), where outputsare flexibly formatted and difficult to automatically evaluate with explicitrules, remains underexplored. In this work, we introduce MT-R1-Zero, the firstopen-source adaptation of the R1-Zero RL framework for MT without supervisedfine-tuning or cold-start. We propose a rule-metric mixed reward mechanism toguide LLMs towards improved translation quality via emergent reasoning. On theWMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitiveperformance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points.Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 acrossall metrics, placing it on par with advanced proprietary models such as GPT-4oand Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achievesstate-of-the-art scores on semantic metrics. Moreover, our work exhibits stronggeneralization capabilities on out-of-distribution MT tasks, robustlysupporting multilingual and low-resource settings. Extensive analysis of modelbehavior across different initializations and reward metrics offers pioneeringinsight into the critical role of reward design, LLM adaptability, trainingdynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT.Our code is available at https://github.com/fzp0424/MT-R1-Zero.