Greedy Sampling Is Provably Efficient for RLHF

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a keytechnique for post-training large language models. Despite its empiricalsuccess, the theoretical understanding of RLHF is still limited, as learningthe KL-regularized target with only preference feedback poses additionalchallenges compared with canonical RL. Existing works mostly study thereward-based Bradley-Terry (BT) preference model, and extend classical designsutilizing optimism or pessimism. This work, instead, considers the generalpreference model (whose practical relevance has been observed recently) andobtains performance guarantees with major, order-wise improvements overexisting ones. Surprisingly, these results are derived from algorithms thatdirectly use the empirical estimates (i.e., greedy sampling), as opposed toconstructing optimistic or pessimistic estimates in previous works. Thisinsight has a deep root in the unique structural property of the optimal policyclass under the KL-regularized target, and we further specialize it to the BTmodel, highlighting the surprising sufficiency of greedy sampling in RLHF.

Quick Read (beta)

loading the full paper ...