ILID: Native Script Language Identification for Indian Languages

Abstract

The language identification task is a crucial fundamental step in NLP. Oftenit serves as a pre-processing step for widely used NLP applications such asmultilingual machine translation, information retrieval, question andanswering, and text summarization. The core challenge of languageidentification lies in distinguishing languages in noisy, short, and code-mixedenvironments. This becomes even harder in case of diverse Indian languages thatexhibit lexical and phonetic similarities, but have distinct differences. ManyIndian languages share the same script, making the task even more challenging.Taking all these challenges into account, we develop and release a dataset of250K sentences consisting of 23 languages including English and all 22 officialIndian languages labeled with their language identifiers, where data in mostlanguages are newly created. We also develop and release baseline models usingstate-of-the-art approaches in machine learning and fine-tuning pre-trainedtransformer models. Our models outperforms the state-of-the-art pre-trainedtransformer models for the language identification task. The dataset and thecodes are available at https://yashingle-ai.github.io/ILID/ and in Huggingfaceopen source libraries.

Quick Read (beta)

loading the full paper ...