Process Reinforcement through Implicit Rewards

Abstract

Dense process rewards have proven a more effective alternative to the sparseoutcome-level rewards in the inference-time scaling of large language models(LLMs), particularly in tasks requiring complex multi-step reasoning. Whiledense rewards also offer an appealing choice for the reinforcement learning(RL) of LLMs since their fine-grained rewards have the potential to addresssome inherent issues of outcome rewards, such as training efficiency and creditassignment, this potential remains largely unrealized. This can be primarilyattributed to the challenges of training process reward models (PRMs) online,where collecting high-quality process labels is prohibitively expensive, makingthem particularly vulnerable to reward hacking. To address these challenges, wepropose PRIME (Process Reinforcement through IMplicit rEwards), which enablesonline PRM updates using only policy rollouts and outcome labels throughimplict process rewards. PRIME combines well with various advantage functionsand forgoes the dedicated reward model training phrase that existing approachesrequire, substantially reducing the development overhead. We demonstratePRIME's effectiveness on competitional math and coding. Starting fromQwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across severalkey reasoning benchmarks over the SFT model. Notably, our resulting model,Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoningbenchmarks with 10% of its training data.

Quick Read (beta)

loading the full paper ...