Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

Abstract

Process Reward Models (PRMs) have emerged as a promising approach to enhancethe reasoning capabilities of large language models (LLMs) by guiding theirstep-by-step reasoning toward a final answer. However, existing PRMs eithertreat each reasoning step in isolation, failing to capture inter-stepdependencies, or struggle to align process rewards with the final outcome.Consequently, the reward signal fails to respect temporal causality insequential reasoning and faces ambiguous credit assignment. These limitationsmake downstream models vulnerable to reward hacking and lead to suboptimalperformance. In this work, we propose Conditional Reward Modeling (CRM) thatframes LLM reasoning as a temporal process leading to a correct answer. Thereward of each reasoning step is not only conditioned on the preceding stepsbut also explicitly linked to the final outcome of the reasoning trajectory. Byenforcing conditional probability rules, our design captures the causalrelationships among reasoning steps, with the link to the outcome allowingprecise attribution of each intermediate step, thereby resolving creditassignment ambiguity. Further, through this consistent probabilistic modeling,the rewards produced by CRM enable more reliable cross-sample comparison.Experiments across Best-of-N sampling, beam search and reinforcement learningdemonstrate that CRM consistently outperforms existing reward models, offeringa principled framework for enhancing LLM reasoning. In particular, CRM is morerobust to reward hacking and delivers stable downstream improvements withoutrelying on verifiable rewards derived from ground truth.

Quick Read (beta)

loading the full paper ...