LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Abstract

Knowing the language of an input text/audio is a necessary first step forusing almost every natural language processing (NLP) tool such as taggers,parsers, or translation systems. Language identification is a well-studiedproblem, sometimes even considered solved; in reality, most of the world's 7000languages are not supported by current systems. This lack of representationaffects large-scale data mining efforts and further exacerbates data shortagefor low-resource languages. We take a step towards tackling the data bottleneckby compiling a corpus of over 50K parallel children's stories in 350+ languagesand dialects, and the computation bottleneck by building lightweighthierarchical models for language identification. Our data can serve asbenchmark data for language identification of short texts and for understudiedtranslation directions such as those between Indian or African languages. Ourproposed method, Hierarchical LIMIT, uses limited computation to expandcoverage into excluded languages while maintaining prediction quality.

Quick Read (beta)

loading the full paper ...