Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Abstract

We create publicly available language identification (LID) datasets andmodels in all 22 Indian languages listed in the Indian constitution in bothnative-script and romanized text. First, we create Bhasha-Abhijnaanam, alanguage identification test set for native-script as well as romanized textwhich spans all 22 Indic languages. We also train IndicLID, a languageidentifier for all the above-mentioned languages in both native and romanizedscript. For native-script text, it has better language coverage than existingLIDs and is competitive or better than other LIDs. IndicLID is the first LIDfor romanized text in Indian languages. Two major challenges for romanized textLID are the lack of training data and low-LID performance when languages aresimilar. We provide simple and effective solutions to these problems. Ingeneral, there has been limited work on romanized text in any language, and ourfindings are relevant to other languages that need romanized languageidentification. Our models are publicly available athttps://github.com/AI4Bharat/IndicLID under open-source licenses. Our trainingand test sets are also publicly available athttps://huggingface.co/datasets/ai4bharat/Bhasha-Abhijnaanam under open-sourcelicenses.

Quick Read (beta)

loading the full paper ...