### Abstract

Autoregressive language models, pretrained using large text corpora to dowell on next word prediction, have been successful at solving many downstreamtasks, even with zero-shot usage. However, there is little theoreticalunderstanding of this success. This paper initiates a mathematical study ofthis phenomenon for the downstream task of text classification by consideringthe following questions: (1) What is the intuitive connection between thepretraining task of next word prediction and text classification? (2) How canwe mathematically formalize this connection and quantify the benefit oflanguage modeling? For (1), we hypothesize, and verify empirically, thatclassification tasks of interest can be reformulated as sentence completiontasks, thus making language modeling a meaningful pretraining task. With amathematical formalization of this hypothesis, we make progress towards (2) andshow that language models that are $\epsilon$-optimal in cross-entropy(log-perplexity) learn features that can linearly solve such classificationtasks with $\mathcal{O}(\sqrt{\epsilon})$ error, thus demonstrating that doingwell on language modeling can be beneficial for downstream tasks. Weexperimentally verify various assumptions and theoretical findings, and alsouse insights from the analysis to design a new objective function that performswell on some classification tasks.