Abstract
Direct Preference Optimization (DPO), a standard method for aligning languagemodels with human preferences, is traditionally applied to offline preferences.Recent studies show that DPO benefits from iterative training with onlinepreferences labeled by a trained reward model. In this work, we identify apitfall of vanilla iterative DPO - improved response quality can lead toincreased verbosity. To address this, we introduce iterative length-regularizedDPO (iLR-DPO) to penalize response length. Our empirical results show thatiLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasingverbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled winrate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels acrossstandard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard.These results demonstrate the effectiveness of iterative DPO in aligninglanguage models with human feedback.