Abstract
Understanding the limits of language is a prerequisite for Large LanguageModels (LLMs) to act as theories of natural language. LLM performance in somelanguage tasks presents both quantitative and qualitative differences from thatof humans, however it remains to be determined whether such differences areamenable to model size. This work investigates the critical role of modelscaling, determining whether increases in size make up for such differencesbetween humans and models. We test three LLMs from different families (Bard,137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on agrammaticality judgment task featuring anaphora, center embedding,comparatives, and negative polarity. N=1,200 judgments are collected and scoredfor accuracy, stability, and improvements in accuracy upon repeatedpresentation of a prompt. Results of the best performing LLM, ChatGPT-4, arecompared to results of n=80 humans on the same stimuli. We find that humans areoverall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), butthat this is due to ChatGPT-4 outperforming humans only in one task condition,namely on grammatical sentences. Additionally, ChatGPT-4 wavers more thanhumans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer,respectively). Thus, while increased model size may lead to better performance,LLMs are still not sensitive to (un)grammaticality the same way as humans are.It seems possible but unlikely that scaling alone can fix this issue. Weinterpret these results by comparing language learning in vivo and in silico,identifying three critical differences concerning (i) the type of evidence,(ii) the poverty of the stimulus, and (iii) the occurrence of semantichallucinations due to impenetrable linguistic reference.