A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

  • 2018-10-09 17:21:41
  • Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Bakalov, Jason Baldridge, David Weiss
  • 3

Abstract

We address fine-grained multilingual language identification: providing alanguage code for every token in a sentence, including codemixed textcontaining multiple languages. Such text is prevalent online, in documents,social media, and message boards. We show that a feed-forward network with asimple globally constrained decoder can accurately and rapidly label bothcodemixed and monolingual text in 100 languages and 100 language pairs. Thismodel outperforms previously published multilingual approaches in terms of bothaccuracy and speed, yielding an 800x speed-up and a 19.5% averaged absolutegain on three codemixed datasets. It furthermore outperforms several benchmarksystems on monolingual language identification.

 

Quick Read (beta)

loading the full paper ...