Abstract
The success of Reinforcement Learning from Human Feedback (RLHF) criticallydepends on the quality of the reward model. While this quality is primarilyevaluated through accuracy, it remains unclear whether accuracy fully captureswhat makes a reward model an effective teacher. We address this question froman optimization perspective. First, we prove that regardless of how accurate areward model is, if it induces low reward variance, then the RLHF objectivesuffers from a flat landscape. Consequently, even a perfectly accurate rewardmodel can lead to extremely slow optimization, underperforming less accuratemodels that induce higher reward variance. We additionally show that a rewardmodel that works well for one language model can induce low reward variance,and thus a flat objective landscape, for another. These results establish afundamental limitation of evaluating reward models solely based on accuracy orindependently of the language model they guide. Experiments using models of upto 8B parameters corroborate our theory, demonstrating the interplay betweenreward variance, accuracy, and reward maximization rate. Overall, our findingshighlight that beyond accuracy, a reward model needs to induce sufficientvariance for efficient optimization.