Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as apromising framework for enhancing the reasoning capabilities of large languagemodels. However, existing approaches such as GRPO often suffer from zerogradients. This problem arises primarily due to fixed clipping bounds fortoken-level probability ratios and the standardization of identical rewards,which can lead to ineffective gradient updates and underutilization ofgenerated responses. In this work, we propose Dynamic Clipping PolicyOptimization(DCPO), which introduces a dynamic clipping strategy thatadaptively adjusts clipping bounds based on token-specific prior probabilitiesto enhance token-level exploration, and a smooth advantage standardizationtechnique that standardizes rewards across cumulative training steps to improvethe response-level effective utilization of generated responses. DCPO achievedstate-of-the-art performance on four benchmarks based on four different models.In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and anAvg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO(36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7Bmodel. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves aperformance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) andGSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in thenonzero advantage over GRPO in four models, doubled the training efficiencyover DAPO, and significantly reduced the token clipping ratio by an order ofmagnitude compared to both GRPO and DAPO, while achieving superior performance.These results highlight DCPO's effectiveness in leveraging generated data moreefficiently for reinforcement learning in large language models.