Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced thereasoning capabilities of large language models. However, existing methods relysolely on outcome rewards, without explicitly optimizing verification orleveraging reliable signals from realistic environments, leading to unreliableself-verification and limited test-time scaling. To address this, we widen theverification-generation asymmetry by explicitly optimizing self-verification,making it a reliable driver of deeper test-time scaling. We introduce ReVeal, amulti-turn reinforcement learning framework that evolves code generationthrough self-verification and tool-based evaluation. ReVeal structureslong-horizon reasoning as iterative generation-verification turns andincorporates TAPO for turn-level credit assignment, fostering the co-evolutionof code and test generation. At inference, this strengthened self-verificationenables the model to use self-constructed tests and tool feedback tocontinuously evolve code for 20+ turns on LiveCodeBench despite training ononly three. It also significantly improves Pass@k, indicating strongerexploration that expands the reasoning boundaries of the base model. Thesefindings highlight the promise of ReVeal as a scalable paradigm for RL trainingand test-time scaling, paving the way for more robust and autonomous AI agents.