Abstract
Recent Large Language Model (LLM) post-training methods rely on token-levelclipping mechanisms during Reinforcement Learning (RL). However, we identify afundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the ImportanceSampling (IS) ratios of positive-advantage tokens are mismatched, leading tounbalanced token weighting for positive and negative tokens. This mismatchsuppresses the update of low-probability tokens while over-amplifying alreadyhigh-probability ones. To address this, we propose Asymmetric ImportanceSampling Policy Optimization (ASPO), which uses a simple yet effective strategythat flips the IS ratios of positive-advantage tokens, aligning their updatedirection with the learning dynamics of negative ones. AIS further incorporatesa soft dual-clipping mechanism to stabilize extreme updates while maintaininggradient flow. Comprehensive experiments on coding and mathematical reasoningbenchmarks demonstrate that ASPO significantly mitigates premature convergence,improves training stability, and enhances final performance over strongGRPO-based baselines. Our analysis provides new insights into the role oftoken-level weighting in OSRL and highlights the critical importance ofcorrecting IS in LLM RL. The code and models of ASPO are available athttps://github.com/wizard-III/Archer2.0.