Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs

Abstract

Balancing exploration and exploitation is a central goal in reinforcementlearning (RL). Despite recent advances in enhancing large language model (LLM)reasoning, most methods lean toward exploitation, and increasingly encounterperformance plateaus. In this work, we revisit entropy -- a signal ofexploration in RL -- and examine its relationship to exploratory reasoning inLLMs. Through empirical analysis, we uncover positive correlations betweenhigh-entropy regions and three types of exploratory reasoning actions: (1)pivotal tokens that determine or connect logical steps, (2) reflective actionssuch as self-verification and correction, and (3) rare behaviors under-exploredby the base LLMs. Motivated by this, we introduce a minimal modification tostandard RL with only one line of code: augmenting the advantage function withan entropy-based term. Unlike traditional maximum-entropy methods whichencourage exploration by promoting uncertainty, we encourage exploration bypromoting longer and deeper reasoning chains. Notably, our method achievessignificant gains on the Pass@K metric -- an upper-bound estimator of LLMreasoning capabilities -- even when evaluated with extremely large K values,pushing the boundaries of LLM reasoning.

Quick Read (beta)

loading the full paper ...