Reward Reasoning Model - Paper Detail

Abstract

Reward models play a critical role in guiding large language models towardoutputs that align with human expectations. However, an open challenge remainsin effectively utilizing test-time compute to enhance reward model performance.In this work, we introduce Reward Reasoning Models (RRMs), which arespecifically designed to execute a deliberate reasoning process beforegenerating final rewards. Through chain-of-thought reasoning, RRMs leverageadditional test-time compute for complex queries where appropriate rewards arenot immediately apparent. To develop RRMs, we implement a reinforcementlearning framework that fosters self-evolved reward reasoning capabilitieswithout requiring explicit reasoning traces as training data. Experimentalresults demonstrate that RRMs achieve superior performance on reward modelingbenchmarks across diverse domains. Notably, we show that RRMs can adaptivelyexploit test-time compute to further improve reward accuracy. The pretrainedreward reasoning models are available athttps://huggingface.co/Reward-Reasoning.

Quick Read (beta)

loading the full paper ...