CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has improved thereasoning abilities of Large Language Models (LLMs) by using rule-based binaryfeedback, helping to mitigate reward hacking. However, current RLVR methodstypically treat whole responses as single actions, assigning the same reward toevery token. This coarse-grained feedback hampers precise credit assignment,making it hard for models to identify which reasoning steps lead to success orfailure, and often results in suboptimal policies and inefficient learning.Methods like PPO provide credit assignment through value estimation, but oftenyield inaccurate and unverifiable signals due to limited sampling. On the otherhand, methods using Process Reward Models can provide step-by-step judgmentsfor each reasoning step, but they require high-quality process supervisionlabels and are time-consuming when applied in online reinforcement learning(RL). To overcome these limitations, we introduce a simple but efficient methodCredit Assignment Policy Optimization (CAPO). Given a reasoning responserollout from the policy model, CAPO directly leverages an off-the-shelf,general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) togenerate all step-wise critique by one pass, thereby providing verifiabletoken-level rewards to refine the tokens that were originally assignedidentical rule-based rewards. This enables more fine-grained credit assignmentin an effective way. Furthermore, to enhance the accuracy and robustness ofCAPO, we employ voting mechanisms that scale with the number of generatedcritiques. Extensive experiments using different backbones like Llama and Qwenmodels and in different sizes show that CAPO consistently outperformssupervised learning-based and RL-based fine-tuning methods across sixchallenging mathematical benchmarks and three out-of-domain benchmarks.

Quick Read (beta)

loading the full paper ...