CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

  • 2025-09-11 17:59:17
  • Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu
  • 0

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigmfor enhancing the reasoning ability of Large Language Models (LLMs). Yetcurrent RLVR methods often explore poorly, leading to premature convergence andentropy collapse. To address this challenge, we introduce Curiosity-DrivenExploration (CDE), a framework that leverages the model's own intrinsic senseof curiosity to guide exploration. We formalize curiosity with signals fromboth the actor and the critic: for the actor, we use perplexity over itsgenerated response, and for the critic, we use the variance of value estimatesfrom a multi-head architecture. Both signals serve as an exploration bonuswithin the RLVR framework to guide the model. Our theoretical analysis showsthat the actor-wise bonus inherently penalizes overconfident errors andpromotes diversity among correct responses; moreover, we connect thecritic-wise bonus to the well-established count-based exploration bonus in RL.Empirically, our method achieves an approximate +3 point improvement overstandard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies acalibration collapse mechanism within RLVR, shedding light on common LLMfailure modes.

 

Quick Read (beta)

loading the full paper ...