Abstract
Transformers have demonstrated exceptional in-context learning capabilities,yet the theoretical understanding of the underlying mechanisms remain limited.A recent work (Elhage et al., 2021) identified a "rich" in-context mechanismknown as induction head, contrasting with "lazy" $n$-gram models that overlooklong-range dependencies. In this work, we provide both approximation andoptimization analyses of how transformers implement induction heads. In theapproximation analysis, we formalize both standard and generalized inductionhead mechanisms, and examine how transformers can efficiently implement them,with an emphasis on the distinct role of each transformer submodule. For theoptimization analysis, we study the training dynamics on a synthetic mixedtarget, composed of a 4-gram and an in-context 2-gram component. This settingenables us to precisely characterize the entire training process and uncover an{\em abrupt transition} from lazy (4-gram) to rich (induction head) mechanismsas training progresses.