Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Abstract

The ability to reason lies at the core of artificial intelligence (AI), andchallenging problems usually call for deeper and longer reasoning to tackle. Acrucial question about AI reasoning is whether models can extrapolate learnedreasoning patterns to solve harder tasks with longer chain-of-thought (CoT). Inthis work, we present a theoretical analysis of transformers learning onsynthetic state-tracking tasks with gradient descent. We mathematically provehow the algebraic structure of state-tracking problems governs the degree ofextrapolation of the learned CoT. Specifically, our theory characterizes thelength generalization of transformers through the mechanism of attentionconcentration, linking the retrieval robustness of the attention layer to thestate-tracking task structure of long-context reasoning. Moreover, fortransformers with limited reasoning length, we prove that a recursiveself-training scheme can progressively extend the range of solvable problemlengths. To our knowledge, we provide the first optimization guarantee thatconstant-depth transformers provably learn $\mathsf{NC}^1$-complete problemswith CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$,unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails.Finally, we present a broad set of experiments supporting our theoreticalresults, confirming the length generalization behaviors and the mechanism ofattention concentration.

Quick Read (beta)

loading the full paper ...