Abstract
Traditional reinforcement learning from human feedback (RLHF) approachesrelying on parametric models like the Bradley-Terry model fall short incapturing the intransitivity and irrationality in human preferences. Recentadvancements suggest that directly working with preference probabilities canyield a more accurate reflection of human preferences, enabling more flexibleand accurate language model alignment. In this paper, we propose aself-play-based method for language model alignment, which treats the problemas a constant-sum two-player game aimed at identifying the Nash equilibriumpolicy. Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO),approximates the Nash equilibrium through iterative policy updates and enjoystheoretical convergence guarantee. Our method can effectively increase thelog-likelihood of the chosen response and decrease that of the rejectedresponse, which cannot be trivially achieved by symmetric pairwise loss such asDirect Preference Optimization (DPO) and Identity Preference Optimization(IPO). In our experiments, using only 60k prompts (without responses) from theUltraFeedback dataset and without any prompt augmentation, by leveraging apre-trained preference model PairRM with only 0.4B parameters, SPPO can obtaina model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves thestate-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo onAlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench andthe Open LLM Leaderboard. Notably, the strong performance of SPPO is achievedwithout additional external supervision (e.g., responses, preferences, etc.)from GPT-4 or other stronger language models.