Abstract
Balancing exploration and exploitation is a central goal in reinforcementlearning (RL). Despite recent advances in enhancing language model (LM)reasoning, most methods lean toward exploitation, and increasingly encounterperformance plateaus. In this work, we revisit entropy -- a signal ofexploration in RL -- and examine its relationship to exploratory reasoning inLMs. Through empirical analysis, we uncover strong positive correlationsbetween high-entropy regions and three types of exploratory reasoning actions:(1) pivotal tokens that determine or connect logical steps, (2) reflectiveactions such as self-verification and correction, and (3) rare behaviorsunder-explored by the base LMs. Motivated by this, we introduce a minimalmodification to standard RL with only one line of code: augmenting theadvantage function with an entropy-based term. Unlike traditionalmaximum-entropy methods which encourage exploration by promoting uncertainty,we encourage exploration by promoting longer and deeper reasoning chains.Notably, our method achieves significant gains on the Pass@K metric -- anupper-bound estimator of LM reasoning capabilities -- even when evaluated withextremely large K values, pushing the boundaries of LM reasoning.