Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

Abstract

Recently, reinforcement learning with verifiable rewards (RLVR) has beenwidely used for enhancing the reasoning abilities of large language models(LLMs). A core challenge in RLVR involves managing the exchange between entropyand performance of policies. Despite the importance of this exchange, afine-grained understanding of when and how this exchange operates mosteffectively remains limited. To bridge this gap, we conduct a systematicempirical analysis of the entropy-performance exchange mechanism of RLVR acrossdifferent levels of granularity. Specifically, we first divide the trainingprocess into two distinct stages based on entropy dynamics, i.e., rising stageand plateau stage, and then systematically investigate how this mechanismvaries across stage-level, instance-level, and token-level granularitiess. Ouranalysis reveals that, in the rising stage, entropy reduction in negativesamples facilitates the learning of effective reasoning patterns, which in turndrives rapid performance gains. Moreover, in the plateau stage, learningefficiency strongly correlates with high-entropy tokens present inlow-perplexity samples and those located at the end of sequences. Motivated bythese findings, we propose two methods that dynamically adjust the rewardsignal using perplexity and positional information to focus RL updates ontokens that exhibit high learning potential, achieving improvements compared tothe baseline methods on various LLMs.

Quick Read (beta)

loading the full paper ...