IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages

Abstract

The rapid progress in question-answering (QA) systems has predominantlybenefited high-resource languages, leaving Indic languages largelyunderrepresented despite their vast native speaker base. In this paper, wepresent IndicSQuAD, a comprehensive multi-lingual extractive QA datasetcovering nine major Indic languages, systematically derived from the SQuADdataset. Building on previous work with MahaSQuAD for Marathi, our approachadapts and extends translation techniques to maintain high linguistic fidelityand accurate answer-span alignment across diverse languages. IndicSQuADcomprises extensive training, validation, and test sets for each language,providing a robust foundation for model development. We evaluate baselineperformances using language-specific monolingual BERT models and themultilingual MuRIL-BERT. The results indicate some challenges inherent inlow-resource settings. Moreover, our experiments suggest potential directionsfor future work, including expanding to additional languages, developingdomain-specific datasets, and incorporating multimodal data. The dataset andmodels are publicly shared at https://github.com/l3cube-pune/indic-nlp

Quick Read (beta)

loading the full paper ...