Abstract
Large reasoning models have achieved remarkable performance through extendedchain-of-thought sequences, yet this computational freedom leads to excessivetoken generation even for simple problems. We present Length-Adaptive PolicyOptimization (LAPO), a novel framework that transforms reasoning length controlfrom an external constraint into an intrinsic model capability. Unlike existingapproaches that impose rigid limits or rely on post-hoc interventions, LAPOenables models to internalize an understanding of appropriate reasoning depththrough a two-stage reinforcement learning process. In the first stage, modelslearn natural reasoning patterns by discovering the statistical distribution ofsuccessful solution lengths. The second stage leverages these patterns asmeta-cognitive guidance, embedding them directly within the model's reasoningcontext to ensure inference-time flexibility. Experiments on mathematicalreasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\%while improving accuracy by 2.3\%. Our analysis reveals that models trainedwith LAPO develop emergent abilities to allocate computational resources basedon problem complexity, achieving efficient reasoning without sacrificingquality.