Abstract
Large language models (LLMs) have shown great potential in natural languageprocessing tasks, but their application to machine translation (MT) remainschallenging due to pretraining on English-centric data and the complexity ofreinforcement learning from human feedback (RLHF). Direct PreferenceOptimization (DPO) has emerged as a simpler and more efficient alternative, butits performance depends heavily on the quality of preference data. To addressthis, we propose Confidence-Reward driven Preference Optimization (CRPO), anovel method that combines reward scores with model confidence to improve dataselection for fine-tuning. CRPO selects challenging sentence pairs where themodel is uncertain or underperforms, leading to more effective learning. Whileprimarily designed for LLMs, CRPO also generalizes to encoder-decoder modelslike NLLB, demonstrating its versatility. Empirical results show that CRPOoutperforms existing methods such as RS-DPO, RSO and MBR score in bothtranslation accuracy and data efficiency.