Abstract
Large language models (LLMs) have demonstrated significant progress invarious natural language generation and understanding tasks. However, theirlinguistic generalization capabilities remain questionable, raising doubtsabout whether these models learn language similarly to humans. While humansexhibit compositional generalization and linguistic creativity in language use,the extent to which LLMs replicate these abilities, particularly in morphology,is under-explored. In this work, we systematically investigate themorphological generalization abilities of LLMs through the lens ofcompositionality. We define morphemes as compositional primitives and design anovel suite of generative and discriminative tasks to assess morphologicalproductivity and systematicity. Focusing on agglutinative languages such asTurkish and Finnish, we evaluate several state-of-the-art instruction-finetunedmultilingual models, including GPT-4 and Gemini. Our analysis shows that LLMsstruggle with morphological compositional generalization particularly whenapplied to novel word roots, with performance declining sharply asmorphological complexity increases. While models can identify individualmorphological combinations better than chance, their performance lackssystematicity, leading to significant accuracy gaps compared to humans.