Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigmfor enhancing the reasoning ability of Large Language Models (LLMs). Yetcurrent RLVR methods often explore poorly, leading to premature convergence andentropy collapse. To address this challenge, we introduce Curiosity-DrivenExploration (CDE), a framework that leverages the model's own intrinsic senseof curiosity to guide exploration. We formalize curiosity with signals fromboth the actor and the critic: for the actor, we use perplexity over itsgenerated response, and for the critic, we use the variance of value estimatesfrom a multi-head architecture. Both signals serve as an exploration bonuswithin the RLVR framework to guide the model. Our theoretical analysis showsthat the actor-wise bonus inherently penalizes overconfident errors andpromotes diversity among correct responses; moreover, we connect thecritic-wise bonus to the well-established count-based exploration bonus in RL.Empirically, our method achieves an approximate +3 point improvement overstandard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies acalibration collapse mechanism within RLVR, shedding light on common LLMfailure modes.