Short Text Language Identification for Under Resourced Languages

Abstract

The paper presents a hierarchical naive Bayesian and lexicon based classifierfor short text language identification (LID) useful for under resourcedlanguages. The algorithm is evaluated on short pieces of text for the 11official South African languages some of which are similar languages. Thealgorithm is compared to recent approaches using test sets from previous workson South African languages as well as the Discriminating between SimilarLanguages (DSL) shared tasks' datasets. Remaining research opportunities andpressing concerns in evaluating and comparing LID approaches are alsodiscussed.

Quick Read (beta)

loading the full paper ...