Abstract
Knowing the language of an input text/audio is a necessary first step forusing almost every NLP tool such as taggers, parsers, or translation systems.Language identification is a well-studied problem, sometimes even consideredsolved; in reality, due to lack of data and computational challenges, currentsystems cannot accurately identify most of the world's 7000 languages. Totackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingualand parallel children's stories in 350+ languages. MCS-350 can serve as abenchmark for language identification of short texts and for 1400+ newtranslation directions in low-resource Indian and African languages. Second, wepropose a novel misprediction-resolution hierarchical model, LIMIt, forlanguage identification that reduces error by 55% (from 0.71 to 0.32) on ourcompiled children's stories dataset and by 40% (from 0.23 to 0.14) on theFLORES-200 benchmark. Our method can expand language identification coverageinto low-resource languages by relying solely on systemic mispredictionpatterns, bypassing the need to retrain large models from scratch.