CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

  • 2025-01-23 18:59:47
  • Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat
  • 0

Abstract

Large language models (LLMs) have shown great potential in natural languageprocessing tasks, but their application to machine translation (MT) remainschallenging due to pretraining on English-centric data and the complexity ofreinforcement learning from human feedback (RLHF). Direct PreferenceOptimization (DPO) has emerged as a simpler and more efficient alternative, butits performance depends heavily on the quality of preference data. To addressthis, we propose Confidence-Reward driven Preference Optimization (CRPO), anovel method that combines reward scores with model confidence to improve dataselection for fine-tuning. CRPO selects challenging sentence pairs where themodel is uncertain or underperforms, leading to more effective learning. Whileprimarily designed for LLMs, CRPO also generalizes to encoder-decoder modelslike NLLB, demonstrating its versatility. Empirical results show that CRPOoutperforms existing methods such as RS-DPO, RSO and MBR score in bothtranslation accuracy and data efficiency.

 

Quick Read (beta)

loading the full paper ...