Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Abstract

Recent impressive improvements in NLP, largely based on the success ofcontextual neural language models, have been mostly demonstrated on at most acouple dozen high-resource languages. Building language models and, moregenerally, NLP systems for non-standardized and low-resource languages remainsa challenging task. In this work, we focus on North-African colloquialdialectal Arabic written using an extension of the Latin script, calledNArabizi, found mostly on social media and messaging communication. In thislow-resource scenario with data displaying a high level of variability, wecompare the downstream performance of a character-based language model onpart-of-speech tagging and dependency parsing to that of monolingual andmultilingual models. We show that a character-based model trained on only 99ksentences of NArabizi and fined-tuned on a small treebank of this languageleads to performance close to those obtained with the same architecturepre-trained on large multilingual and monolingual models. Confirming theseresults a on much larger data set of noisy French user-generated content, weargue that such character-based language models can be an asset for NLP inlow-resource and high language variability set-tings.

Quick Read (beta)

loading the full paper ...