Assessing Language Comprehension in Large Language Models Using Construction Grammar

Abstract

Large Language Models, despite their significant capabilities, are known tofail in surprising and unpredictable ways. Evaluating their true`understanding' of language is particularly challenging due to the extensiveweb-scale data they are trained on. Therefore, we construct an evaluation tosystematically assess natural language understanding (NLU) in LLMs byleveraging Construction Grammar (CxG), which provides insights into the meaningcaptured by linguistic elements known as constructions (Cxns). CxG iswell-suited for this purpose because provides a theoretical basis to constructtargeted evaluation sets. These datasets are carefully constructed to includeexamples which are unlikely to appear in pre-training data, yet intuitive andeasy for humans to understand, enabling a more targeted and reliableassessment. Our experiments focus on downstream natural language inference andreasoning tasks by comparing LLMs' understanding of the underlying meaningscommunicated through 8 unique Cxns with that of humans. The results show thatwhile LLMs demonstrate some knowledge of constructional information, even thelatest models including GPT-o1 struggle with abstract meanings conveyed bythese Cxns, as demonstrated in cases where test sentences are dissimilar totheir pre-training data. We argue that such cases provide a more accurate testof true language understanding, highlighting key limitations in LLMs' semanticcapabilities. We make our novel dataset and associated experimental dataincluding prompts and model responses publicly available.

Quick Read (beta)

loading the full paper ...