Abstract
We propose reinforcement learning (RL) strategies tailored for reasoning inlarge language models (LLMs) under strict memory and compute limits, with aparticular focus on compatibility with LoRA fine-tuning. Building on earlypolicy gradient methods with baseline subtraction, we design critic-freemethods that operate on a small, informative subset of output tokens to reducememory usage and stabilize training. We introduce S-GRPO, a stochastic variantof Group Relative Policy Optimization, and T-SPMO, a token-level prefixmatching approach for fine-grained credit assignment. Applied to Qwen2-1.5B,our methods raise accuracy on the SVAMP benchmark from 46% to over 70% and showstrong performance on multi-digit multiplication. Surprisingly, full-token GRPOunder LoRA fails to improve over the base model, suggesting that selectivetoken-level optimization may act as an implicit regularizer in low-parametertraining regimes.