Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

  • 2024-06-17 18:55:38
  • Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang
  • 0

Abstract

Direct Preference Optimization (DPO), a standard method for aligning languagemodels with human preferences, is traditionally applied to offline preferences.Recent studies show that DPO benefits from iterative training with onlinepreferences labeled by a trained reward model. In this work, we identify apitfall of vanilla iterative DPO - improved response quality can lead toincreased verbosity. To address this, we introduce iterative length-regularizedDPO (iLR-DPO) to penalize response length. Our empirical results show thatiLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasingverbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled winrate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels acrossstandard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard.These results demonstrate the effectiveness of iterative DPO in aligninglanguage models with human feedback.

 

Quick Read (beta)

loading the full paper ...