Abstract
This paper aims to overcome a major obstacle in scaling RL for reasoning withLLMs, namely the collapse of policy entropy. Such phenomenon is consistentlyobserved across vast RL runs without entropy intervention, where the policyentropy dropped sharply at the early training stage, this diminishedexploratory ability is always accompanied with the saturation of policyperformance. In practice, we establish a transformation equation R=-a*e^H+bbetween entropy H and downstream performance R. This empirical law stronglyindicates that, the policy performance is traded from policy entropy, thusbottlenecked by its exhaustion, and the ceiling is fully predictable H=0,R=-a+b. Our finding necessitates entropy management for continuous explorationtoward scaling compute for RL. To this end, we investigate entropy dynamicsboth theoretically and empirically. Our derivation highlights that, the changein policy entropy is driven by the covariance between action probability andthe change in logits, which is proportional to its advantage when using PolicyGradient-like algorithms. Empirical study shows that, the values of covarianceterm and entropy differences matched exactly, supporting the theoreticalconclusion. Moreover, the covariance term stays mostly positive throughouttraining, further explaining why policy entropy would decrease monotonically.Through understanding the mechanism behind entropy dynamics, we motivate tocontrol entropy by restricting the update of high-covariance tokens.Specifically, we propose two simple yet effective techniques, namely Clip-Covand KL-Cov, which clip and apply KL penalty to tokens with high covariancesrespectively. Experiments show that these methods encourage exploration, thushelping policy escape entropy collapse and achieve better downstreamperformance.