Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Abstract

Recent curriculum techniques in the post-training stage of LLMs have beenwidely observed to outperform non-curriculum approaches in enhancing reasoningperformance, yet a principled understanding of why and to what extent they workremains elusive. To address this gap, we develop a theoretical frameworkgrounded in the intuition that progressively learning through manageable stepsis more efficient than directly tackling a hard reasoning task, provided eachstage stays within the model's effective competence. Under mild complexityconditions linking consecutive curriculum stages, we show that curriculumpost-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts(CoTs) solving mathematical problems such as Countdown and parity, we model CoTgeneration as a states-conditioned autoregressive reasoning tree, define auniform-branching base model to capture pretrained behavior, and formalizecurriculum stages as either depth-increasing (longer reasoning chains) orhint-decreasing (shorter prefixes) subtasks. Our analysis shows that, underoutcome-only reward signals, reinforcement learning finetuning achieves highaccuracy with polynomial sample complexity, whereas direct learning suffersfrom an exponential bottleneck. We further establish analogous guarantees fortest-time scaling, where curriculum-aware querying reduces both reward oraclecalls and sampling cost from exponential to polynomial order.

Quick Read (beta)

loading the full paper ...