Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems

Abstract

Inspired by recent findings on the fractal geometry of language, we introduceRecursive INference Scaling (RINS) as a complementary, plug-in recipe forscaling inference time in language and multimodal systems. RINS is a particularform of recursive depth that significantly outperforms +55 other variants,including the recent "repeat-all-over" (RAO) strategy in Mobile LLM (Liu etal., 2024) and latent recurrent thinking (Geiping et al., 2025). Unlike priorworks, we carry out our comparisons on a compute-matched regime, anddemonstrate that for a fixed model size and training compute budget, RINSsubstantially improves language modeling performance. It also generalizesbeyond pure language tasks, delivering gains in multimodal systems, including a+2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, byderiving data scaling laws, we show that RINS improves both the asymptoticperformance limits and the scaling exponents. More importantly, withlight-weight (linear) adapters (comprising <1% of model parameters) andstochastic dropout, RINS offers a no-regret strategy, meaning that RINS-enabledpretraining improves performance in language modeling even when recursive depthis not applied at inference time. This corresponds to improving performance ona training compute-, parameter-, and inference-matched regime, suggesting itspotential as a viable component of LLM pretraining!

Quick Read (beta)

loading the full paper ...