GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Abstract

Reinforcement learning (RL) with algorithms like Group Relative PolicyOptimization (GRPO) improves Large Language Model (LLM) reasoning, but islimited by a coarse-grained credit assignment that applies a uniform reward toall tokens in a sequence. This is a major flaw in long-chain reasoning tasks.This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core ideais that high-entropy tokens in correct responses can guide the policy toward ahigher performance ceiling. This allows us to create more fine-grained rewardsignals for precise policy updates via two ways: 1) \textbf{Group Token PolicyOptimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to eachtoken for fine-grained credit assignment. 2) \textbf{Sequence-Level GroupRelative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weightedreward to each sequence based on its average token entropy. Experiments showour methods significantly outperform the strong DAPO baseline. The resultsconfirm that our entropy-weighting mechanism is the key driver of thisperformance boost, offering a better path to enhance deep reasoning in models.

Quick Read (beta)

loading the full paper ...