Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Abstract

Large Language Models (LLMs) show great promise in complex reasoning, withReinforcement Learning with Verifiable Rewards (RLVR) being a key enhancementstrategy. However, a prevalent issue is ``superficial self-reflection'', wheremodels fail to robustly verify their own outputs. We introduce RISE(Reinforcing Reasoning with Self-Verification), a novel online RL frameworkdesigned to tackle this. RISE explicitly and simultaneously trains an LLM toimprove both its problem-solving and self-verification abilities within asingle, integrated RL process. The core mechanism involves leveragingverifiable rewards from an outcome verifier to provide on-the-fly feedback forboth solution generation and self-verification tasks. In each iteration, themodel generates solutions, then critiques its own on-policy generatedsolutions, with both trajectories contributing to the policy update. Extensiveexperiments on diverse mathematical reasoning benchmarks show that RISEconsistently improves model's problem-solving accuracy while concurrentlyfostering strong self-verification skills. Our analyses highlight theadvantages of online verification and the benefits of increased verificationcompute. Additionally, RISE models exhibit more frequent and accurateself-verification behaviors during reasoning. These advantages reinforce RISEas a flexible and effective path towards developing more robust and self-awarereasoners.

Quick Read (beta)

loading the full paper ...