Abstract
We introduce TurBLiMP, the first Turkish benchmark of linguistic minimalpairs, designed to evaluate the linguistic abilities of monolingual andmultilingual language models (LMs). Covering 16 linguistic phenomena with 1000minimal pairs each, TurBLiMP fills an important gap in linguistic evaluationresources for Turkish. In designing the benchmark, we give extra attention totwo properties of Turkish that remain understudied in current syntacticevaluations of LMs, namely word order flexibility and subordination throughmorphological processes. Our experiments on a wide range of LMs and a newlycollected set of human acceptability judgments reveal that even cutting-edgeLarge LMs still struggle with grammatical phenomena that are not challengingfor humans, and may also exhibit different sensitivities to word order andmorphological complexity compared to humans.