Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Abstract

Large reasoning models (LRMs) can do complex reasoning via longchain-of-thought (CoT) involving cognitive strategies such as backtracking andself-correction. Recent studies suggest that some models inherently possessthese long reasoning abilities, which may be unlocked via extra training. Ourwork first investigates whether we can elicit such behavior without anytraining. To this end, we propose a decoding-time approach, ThinkLogit, whichutilizes logits arithmetic (Liu et al., 2024) to tune a target large LM forlong reasoning using a substantially smaller model as guider. We then show thatwe can further boost performance by training the guider model with preferenceoptimization over correct/incorrect reasoning pairs sampled from both thetarget and guider model -- a setup we refer to as ThinkLogit-DPO. Ourexperiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relativeimprovement in pass@1 by 26% and 29%, respectively, over four mathematicaldatasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B -- a model21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skillsacquired through reinforcement learning, improving pass@1 by 13% relativecompared to the Qwen2.5-32B base model. Our work presents acomputationally-efficient method to elicit long reasoning in large models withminimal or no additional training.

Quick Read (beta)

loading the full paper ...