CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Abstract

Large language models (LLMs) have shown great potential in natural languageprocessing tasks, but their application to machine translation (MT) remainschallenging due to pretraining on English-centric data and the complexity ofreinforcement learning from human feedback (RLHF). Direct PreferenceOptimization (DPO) has emerged as a simpler and more efficient alternative, butits performance depends heavily on the quality of preference data. To addressthis, we propose Confidence-Reward driven Preference Optimization (CRPO), anovel method that combines reward scores with model confidence to improve dataselection for fine-tuning. CRPO selects challenging sentence pairs where themodel is uncertain or underperforms, leading to more effective learning. Whileprimarily designed for LLMs, CRPO also generalizes to encoder-decoder modelslike NLLB, demonstrating its versatility. Empirical results show that CRPOoutperforms existing methods such as RS-DPO, RSO and MBR score in bothtranslation accuracy and data efficiency.

Quick Read (beta)

loading the full paper ...