Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Abstract

Process reward models (PRMs) have proven effective for test-time scaling ofLarge Language Models (LLMs) on challenging reasoning tasks. However, rewardhacking issues with PRMs limit their successful application in reinforcementfine-tuning. In this paper, we identify the main cause of PRM-induced rewardhacking: the canonical summation-form credit assignment in reinforcementlearning (RL), which defines the value as cumulative gamma-decayed futurerewards, easily induces LLMs to hack steps with high rewards. To address this,we propose PURE: Process sUpervised Reinforcement lEarning. The key innovationof PURE is a min-form credit assignment that formulates the value function asthe minimum of future rewards. This method significantly alleviates rewardhacking by limiting the value function range and distributing advantages morereasonably. Through extensive experiments on 3 base models, we show thatPRM-based approaches enabling min-form credit assignment achieve comparablereasoning performance to verifiable reward-based methods within only 30% steps.In contrast, the canonical sum-form credit assignment collapses training evenat the beginning! Additionally, when we supplement PRM-based fine-tuning withjust 10% verifiable rewards, we further alleviate reward hacking and producethe best fine-tuned model based on Qwen2.5-Math-7B in our experiments,achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5benchmarks. Moreover, we summarize the observed reward hacking cases andanalyze the causes of training collapse. Code and models are available athttps://github.com/CJReinforce/PURE.

Quick Read (beta)

loading the full paper ...