Generative Verifiers: Reward Modeling as Next-Token Prediction

Abstract

Verifiers or reward models are often used to enhance the reasoningperformance of large language models (LLMs). A common approach is the Best-of-Nmethod, where N candidate solutions generated by the LLM are ranked by averifier, and the best one is selected. While LLM-based verifiers are typicallytrained as discriminative classifiers to score solutions, they do not utilizethe text generation capabilities of pretrained LLMs. To overcome thislimitation, we instead propose training verifiers using the ubiquitousnext-token prediction objective, jointly on verification and solutiongeneration. Compared to standard verifiers, such generative verifiers (GenRM)can benefit from several advantages of LLMs: they integrate seamlessly withinstruction tuning, enable chain-of-thought reasoning, and can utilizeadditional test-time compute via majority voting for better verification. Wedemonstrate that GenRM outperforms discriminative, DPO verifiers, andLLM-as-a-Judge, resulting in a 16-40% improvement in the number of problemssolved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, wefind that training GenRM with synthetic verification rationales is sufficientto pick out subtle errors on math problems. Finally, we demonstrate thatgenerative verifiers scale favorably with model size and inference-timecompute.

Quick Read (beta)

loading the full paper ...