Token-Efficient RL for LLM Reasoning

Abstract

We propose reinforcement learning (RL) strategies tailored for reasoning inlarge language models (LLMs) under strict memory and compute limits, with aparticular focus on compatibility with LoRA fine-tuning. Building on earlypolicy gradient methods with baseline subtraction, we design critic-freemethods that operate on a small, informative subset of output tokens to reducememory usage and stabilize training. We introduce S-GRPO, a stochastic variantof Group Relative Policy Optimization, and T-SPMO, a token-level prefixmatching approach for fine-grained credit assignment. Applied to Qwen2-1.5B,our methods raise accuracy on the SVAMP benchmark from 46% to over 70% and showstrong performance on multi-digit multiplication. Surprisingly, full-token GRPOunder LoRA fails to improve over the base model, suggesting that selectivetoken-level optimization may act as an implicit regularizer in low-parametertraining regimes.

Quick Read (beta)

loading the full paper ...