IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Abstract

Transformer-based models have revolutionized the field of natural languageprocessing. To understand why they perform so well and to assess theirreliability, several studies have focused on questions such as: Whichlinguistic properties are encoded by these models, and to what extent? Howrobust are these models in encoding linguistic properties when faced withperturbations in the input text? However, these studies have mainly focused onBERT and the English language. In this paper, we investigate similar questionsregarding encoding capability and robustness for 8 linguistic properties across13 different perturbations in 6 Indic languages, using 9 multilingualTransformer models (7 universal and 2 Indic-specific). To conduct this study,we introduce a novel multilingual benchmark dataset, IndicSentEval, containingapproximately $\sim$47K sentences. Surprisingly, our probing analysis ofsurface, syntactic, and semantic properties reveals that while almost allmultilingual models demonstrate consistent encoding performance for English,they show mixed results for Indic languages. As expected, Indic-specificmultilingual models capture linguistic properties in Indic languages betterthan universal models. Intriguingly, universal models broadly exhibit betterrobustness compared to Indic-specific models, particularly under perturbationssuch as dropping both nouns and verbs, dropping only verbs, or keeping onlynouns. Overall, this study provides valuable insights into probing andperturbation-specific strengths and weaknesses of popular multilingualTransformer-based models for different Indic languages. We make our code anddataset publicly available [https://tinyurl.com/IndicSentEval}].

Quick Read (beta)

loading the full paper ...