Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

  • 2021-10-02 07:01:45
  • Rajaswa Patil, Jasleen Dhillon, Siddhant Mahurkar, Saumitra Kulkarni, Manav Malhotra, Veeky Baths
  • 0


While there has been significant progress towards developing NLU resourcesfor Indic languages, syntactic evaluation has been relatively less explored.Unlike English, Indic languages have rich morphosyntax, grammatical genders,free linear word-order, and highly inflectional morphology. In this paper, weintroduce Vy\=akarana: a benchmark of Colorless Green sentences in Indiclanguages for syntactic evaluation of multilingual language models. Thebenchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depthPrediction, Grammatical Case Marking, and Subject-Verb Agreement. We use thedatasets from the evaluation tasks to probe five multilingual language modelsof varying architectures for syntax in Indic languages. Due to its prevalence,we also include a code-switching setting in our experiments. Our results showthat the token-level and sentence-level representations from the Indic languagemodels (IndicBERT and MuRIL) do not capture the syntax in Indic languages asefficiently as the other highly multilingual language models. Further, ourlayer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-Rlocalize the syntax in middle layers, the Indic language models do not showsuch syntactic localization.


