Beyond RLHF: A Unified Theoretical Framework of Alignment

Abstract

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for the RLHF objective itself and do not allow comparisons of the guarantees between various methods because different methods are often analyzed under different frameworks. Toward a unified framework for alignment, we ask under what assumptions can we derive existing or new training objectives and obtain theoretical guarantees. To this end, we reframe alignment as distribution learning from pairwise preferences, which makes a probabilistic assumption describing how preferences reveal information about the target LM. This leads us to propose three principled alignment objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We prove that they all enjoy strong non-asymptotic $O(1/n)$ convergence to the target LM, naturally avoiding degeneracy. In particular, reverse KL highly resembles the RLHF objective, providing strong justification for RLHF. Furthermore, our theory explains, for the first time, the empirical finding that on-policy objectives (e.g., RLHF) typically outperform likelihood-style objectives (e.g., DPO). Finally, empirical results indicate that the proposed objectives are competitive with strong baselines across several tasks and models.

Quick Read (beta)

loading the full paper ...