Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

Abstract

While there has been significant progress towards developing NLU resourcesfor Indic languages, syntactic evaluation has been relatively less explored.Unlike English, Indic languages have rich morphosyntax, grammatical genders,free linear word-order, and highly inflectional morphology. In this paper, weintroduce Vy\=akarana: a benchmark of Colorless Green sentences in Indiclanguages for syntactic evaluation of multilingual language models. Thebenchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depthPrediction, Grammatical Case Marking, and Subject-Verb Agreement. We use thedatasets from the evaluation tasks to probe five multilingual language modelsof varying architectures for syntax in Indic languages. Due to its prevalence,we also include a code-switching setting in our experiments. Our results showthat the token-level and sentence-level representations from the Indic languagemodels (IndicBERT and MuRIL) do not capture the syntax in Indic languages asefficiently as the other highly multilingual language models. Further, ourlayer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-Rlocalize the syntax in middle layers, the Indic language models do not showsuch syntactic localization.

Quick Read (beta)

loading the full paper ...