Abstract
Can LLMs pick up language structure from examples? Evidence in prior workseems to indicate yes, as pretrained models repeatedly demonstrate the abilityto adapt to new language structures and vocabularies. However, this line ofresearch typically considers languages that are present within commonpretraining datasets, or otherwise share notable similarities with these seenlanguages. In contrast, in this work we attempt to measure models' languageunderstanding capacity while circumventing the risk of dataset recall. Weparameterize large families of language tasks recognized by deterministicfinite automata (DFAs), and can thus sample novel language reasoning problemsto fairly evaulate LLMs regardless of training data. We find that, even in thestrikingly simple setting of 3-state DFAs, LLMs underperform unparameterizedngram models on both language recognition and synthesis tasks. These resultssuggest that LLMs struggle to match the ability of basic language models inrecognizing and reasoning over languages that are sufficiently distinct fromthe ones they see at training time, underscoring the distinction betweenlearning individual languages and possessing a general theory of language.