Abstract
Large reasoning models (LRMs) have exhibited remarkable reasoningcapabilities through inference-time scaling, but this progress has alsointroduced considerable redundancy and inefficiency into their reasoningprocesses, resulting in substantial computational waste. Previous work hasattempted to mitigate this issue by penalizing the overall length of generatedsamples during reinforcement learning (RL), with the goal of encouraging a moreconcise chains of thought. However, we observe that such global length penaltyoften lead to excessive compression of critical reasoning steps whilepreserving unnecessary details in simpler ones, yielding a suboptimal trade-offbetween accuracy and efficiency. To address this issue, we proposeSmartThinker, a two-stage learnable framework designed to enable fine-grainedcontrol over the length of reasoning chains based on the importance of eachindividual step. In the first stage, SmartThinker adapts a reasoning model to ashort-form reasoning mode through rejection sampling combined with supervisedfine-tuning (SFT). In the second stage, SmartThinker applies Step-Level LengthControl Policy Optimization (SCPO) to refine the model output distribution,which increases the proportion of length allocated to critical steps whilereducing redundancy in less important ones. SCPO consists of four corecomponents: an online importance estimator, a step-level length control rewardfunction, a step-level generalized advantage estimation (S-GAE) and adifficulty-adaptive clipping strategy. Working in concert, these componentsenable SCPO to implement differentiated length control across reasoning steps.Empirical results across multiple reasoning benchmarks and various backbonemodels demonstrate that SmartThinker significantly reduces redundant reasoningwhile achieving comparable or even superior performance to existing methods.