Abstract
Large-scale neural language models exhibit a remarkable capacity forin-context learning (ICL): they can infer novel functions from datasetsprovided as input. Most of our current understanding of when and how ICL arisescomes from LMs trained on extremely simple learning problems like linearregression and associative recall. There remains a significant gap betweenthese model problems and the "real" ICL exhibited by LMs trained on large textcorpora, which involves not just retrieval and function approximation butfree-form generation of language and other structured outputs. In this paper,we study ICL through the lens of a new family of model problems we term incontext language learning (ICLL). In ICLL, LMs are presented with a set ofstrings from a formal language, and must generate additional strings from thesame language. We focus on in-context learning of regular languages generatedby random finite automata. We evaluate a diverse set of neural sequence models(including several RNNs, Transformers, and state-space model variants) onregular ICLL tasks, aiming to answer three questions: (1) Which model classesare empirically capable of ICLL? (2) What algorithmic solutions do successfulmodels implement to perform ICLL? (3) What architectural changes can improveICLL in less performant models? We first show that Transformers significantlyoutperform neural sequence models with recurrent or convolutionalrepresentations on ICLL tasks. Next, we provide evidence that their ability todo so relies on specialized "n-gram heads" (higher-order variants of inductionheads) that compute input-conditional next-token distributions. Finally, weshow that hard-wiring these heads into neural models improves performance notjust on ICLL, but natural language modeling -- improving the perplexity of340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.