Abstract
Large language models are increasingly trained on corpora containing bothnatural language and non-linguistic data like source code. Aside from aidingprogramming-related tasks, anecdotal evidence suggests that including code inpretraining corpora may improve performance on other, unrelated tasks, yet todate no work has been able to establish a causal connection by controllingbetween language and code data. Here we do just this. We pretrain languagemodels on datasets which interleave natural language and code in two differentsettings: additive, in which the total volume of data seen during pretrainingis held constant; and competitive, in which the volume of language data is heldconstant. We study how the pretraining mixture affects performance on (a) adiverse collection of tasks included in the BigBench benchmark, and (b)compositionality, measured by generalization accuracy on semantic parsing andsyntactic transformations. We find that pretraining on higher proportions ofcode improves performance on compositional tasks involving structured output(like semantic parsing), and mathematics. Conversely, increase code mixture canharm performance on other tasks, including on tasks that requires sensitivityto linguistic structure such as syntax or morphology, and tasks measuringreal-world knowledge.