Kinetics: Rethinking Test-Time Scaling Laws

Abstract

We rethink test-time scaling laws from a practical efficiency perspective,revealing that the effectiveness of smaller models is significantlyoverestimated. Prior work, grounded in compute-optimality, overlooks criticalmemory access bottlenecks introduced by inference-time strategies (e.g.,Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to32B parameters, reveals a new Kinetics Scaling Law that better guides resourceallocation by incorporating both computation and memory access costs. KineticsScaling Law suggests that test-time compute is more effective when used onmodels above a threshold than smaller ones. A key reason is that in TTS,attention, rather than parameter count, emerges as the dominant cost factor.Motivated by this, we propose a new scaling paradigm centered on sparseattention, which lowers per-token cost and enables longer generations and moreparallel samples within the same resource budget. Empirically, we show thatsparse attention models consistently outperform dense counterparts, achievingover 60 points gains in low-cost regimes and over 5 points gains in high-costregimes for problem-solving accuracy on AIME, encompassing evaluations onstate-of-the-art MoEs. These results suggest that sparse attention is essentialfor realizing the full potential of test-time scaling because, unlike training,where parameter scaling saturates, test-time accuracy continues to improvethrough increased generation. The code is available athttps://github.com/Infini-AI-Lab/Kinetics.

Quick Read (beta)

loading the full paper ...