A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Abstract

We present Self-Play Preference Optimization (SPO), an algorithm forreinforcement learning from human feedback. Our approach is minimalist in thatit does not require training a reward model nor unstable adversarial trainingand is therefore rather simple to implement. Our approach is maximalist in thatit provably handles non-Markovian, intransitive, and stochastic preferenceswhile being robust to the compounding errors that plague offline approaches tosequential prediction. To achieve the preceding qualities, we build upon theconcept of a Minimax Winner (MW), a notion of preference aggregation from thesocial choice theory literature that frames learning from preferences as azero-sum game between two policies. By leveraging the symmetry of this game, weprove that rather than using the traditional technique of dueling two policiesto compute the MW, we can simply have a single agent play against itself whilemaintaining strong convergence guarantees. Practically, this corresponds tosampling multiple trajectories from a policy, asking a rater or preferencemodel to compare them, and then using the proportion of wins as the reward fora particular trajectory. We demonstrate that on a suite of continuous controltasks, we are able to learn significantly more efficiently than reward-modelbased approaches while maintaining robustness to the intransitive andstochastic preferences that frequently occur in practice when aggregating humanjudgments.

Quick Read (beta)

loading the full paper ...