Language Identification for Austronesian Languages

Abstract

This paper provides language identification models for low- andunder-resourced languages in the Pacific region with a focus on previouslyunavailable Austronesian languages. Accurate language identification is animportant part of developing language resources. The approach taken in thispaper combines 29 Austronesian languages with 171 non-Austronesian languages tocreate an evaluation set drawn from eight data sources. After evaluating sixapproaches to language identification, we find that a classifier based onskip-gram embeddings reaches a significantly higher performance than alternatemethods. We then systematically increase the number of non-Austronesianlanguages in the model up to a total of 800 languages to evaluate whether anincreased language inventory leads to less precise predictions for theAustronesian languages of interest. This evaluation finds that there is only aminimal impact on accuracy caused by increasing the inventory ofnon-Austronesian languages. Further experiments adapt these languageidentification models for code-switching detection, achieving high accuracyacross all 29 languages.

Quick Read (beta)

loading the full paper ...