Abstract
The Latin script is often used to informally write languages with non-Latinnative scripts. In many cases (e.g., most languages in India), there is noconventional spelling of words in the Latin script, hence there will be highspelling variability in written text. Such romanization renders languages thatare normally easily distinguished based on script highly confusable, such asHindi and Urdu. In this work, we increase language identification (LID)accuracy for romanized text by improving the methods used to synthesizetraining sets. We find that training on synthetic samples which incorporatenatural spelling variation yields higher LID system accuracy than includingavailable naturally occurring examples in the training set, or even traininghigher capacity models. We demonstrate new state-of-the-art LID performance onromanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set(Madhani et al., 2023a), improving test F1 from the reported 74.7% (using apretrained neural model) to 85.4% using a linear classifier trained solely onsynthetic data and 88.2% when also training on available harvested text.