Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

Abstract

This paper describes the Dakshina dataset, a new resource consisting of textin both the Latin and native scripts for 12 South Asian languages. The datasetincludes, for each language: 1) native script Wikipedia text; 2) a romanizationlexicon; and 3) full sentence parallel data in both a native script of thelanguage and the basic Latin alphabet. We document the methods used forpreparation and selection of the Wikipedia text in each language; collection ofattested romanizations for sampled lexicons; and manual romanization ofheld-out sentences from the native script collections. We additionally providebaseline results on several tasks made possible by the dataset, includingsingle word transliteration, full sentence transliteration, and languagemodeling of native script and romanized text. Keywords: romanization,transliteration, South Asian languages

Quick Read (beta)

loading the full paper ...