### Abstract

Autoregressive language models pretrained on large corpora have beensuccessful at solving downstream tasks, even with zero-shot usage. However,there is little theoretical justification for their success. This paperconsiders the following questions: (1) Why should learning the distribution ofnatural language help with downstream classification tasks? (2) Why do featureslearned using language modeling help solve downstream tasks with linearclassifiers? For (1), we hypothesize, and verify empirically, thatclassification tasks of interest can be reformulated as next word predictiontasks, thus making language modeling a meaningful pretraining task. For (2), weanalyze properties of the cross-entropy objective to show that$\epsilon$-optimal language models in cross-entropy (log-perplexity) learnfeatures that are $\mathcal{O}(\sqrt{\epsilon})$-good on natural linearclassification tasks, thus demonstrating mathematically that doing well onlanguage modeling can be beneficial for downstream tasks. We performexperiments to verify assumptions and validate theoretical results. Ourtheoretical insights motivate a simple alternative to the cross-entropyobjective that performs well on some linear classification tasks.