The Expressive Power of Transformers with Chain of Thought

Abstract

Recent theoretical work has identified surprisingly simple reasoningproblems, such as checking if two nodes in a graph are connected or simulatingfinite-state machines, that are provably unsolvable by standard transformersthat answer immediately after reading their input. However, in practice,transformers' reasoning can be improved by allowing them to use a "chain ofthought" or "scratchpad", i.e., generate and condition on a sequence ofintermediate tokens before answering. Motivated by this, we ask: Does suchintermediate generation fundamentally extend the computational power of adecoder-only transformer? We show that the answer is yes, but the amount ofincrease depends crucially on the amount of intermediate generation. Forinstance, we find that transformer decoders with a logarithmic number ofdecoding steps (w.r.t. the input length) push the limits of standardtransformers only slightly, while a linear number of decoding steps adds aclear new ability (under standard complexity conjectures): recognizing allregular languages. Our results also imply that linear steps keep transformerdecoders within context-sensitive languages, and polynomial steps make themrecognize exactly the class of polynomial-time solvable problems -- the firstexact characterization of a type of transformers in terms of standardcomplexity classes. Together, our results provide a nuanced framework forunderstanding how the length of a transformer's chain of thought or scratchpadimpacts its reasoning power.

Quick Read (beta)

loading the full paper ...