GRAM: A Generative Foundation Reward Model for Reward Generalization

Abstract

In aligning large language models (LLMs), reward models have played animportant role, but are standardly trained as discriminative models and relyonly on labeled human preference data. In this paper, we explore methods thattrain reward models using both unlabeled and labeled data. Building on thegenerative models in LLMs, we develop a generative reward model that is firsttrained via large-scale unsupervised learning and then fine-tuned viasupervised learning. We also show that by using label smoothing, we are in factoptimizing a regularized pairwise ranking loss. This result, in turn, providesa new view of training reward models, which links generative models anddiscriminative models under the same class of training objectives. The outcomeof these techniques is a foundation reward model, which can be applied to awide range of tasks with little or no further fine-tuning effort. Extensiveexperiments show that this model generalizes well across several tasks,including response ranking, reinforcement learning from human feedback, andtask adaptation with fine-tuning, achieving significant performanceimprovements over several strong baseline models.

Quick Read (beta)

loading the full paper ...