Abstract
Reward models (RMs) are at the crux of successful RLHF to align pretrainedmodels to human preferences, yet there has been relatively little study thatfocuses on evaluation of those reward models. Evaluating reward models presentsan opportunity to understand the opaque technologies used for alignment oflanguage models and which values are embedded in them. To date, very fewdescriptors of capabilities, training methods, or open-source reward modelsexist. In this paper, we present RewardBench, a benchmark dataset and code-basefor evaluation, to enhance scientific understanding of reward models. TheRewardBench dataset is a collection of prompt-win-lose trios spanning chat,reasoning, and safety, to benchmark how reward models perform on challenging,structured and out-of-distribution queries. We created specific comparisondatasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrectfacts) why one answer should be preferred to another. On the RewardBenchleaderboard, we evaluate reward models trained with a variety of methods, suchas the direct MLE training of classifiers and the implicit reward modeling ofDirect Preference Optimization (DPO), and on a spectrum of datasets. We presentmany findings on propensity for refusals, reasoning limitations, andinstruction following shortcomings of various reward models towards a betterunderstanding of the RLHF process.