TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Abstract

Pretrained contextualized text representation models learn an effectiverepresentation of a natural language to make it machine understandable. Afterthe breakthrough of the attention mechanism, a new generation of pretrainedmodels have been proposed achieving good performances since the introduction ofthe Transformer. Bidirectional Encoder Representations from Transformers (BERT)has become the state-of-the-art model for language understanding. Despite theirsuccess, most of the available models have been trained on Indo-Europeanlanguages however similar research for under-represented languages and dialectsremains sparse. In this paper, we investigate the feasibility of training monolingualTransformer-based language models for under represented languages, with aspecific focus on the Tunisian dialect. We evaluate our language model onsentiment analysis task, dialect identification task and reading comprehensionquestion-answering task. We show that the use of noisy web crawled data insteadof structured data (Wikipedia, articles, etc.) is more convenient for suchnon-standardized language. Moreover, results indicate that a relatively smallweb crawled dataset leads to performances that are as good as those obtainedusing larger datasets. Finally, our best performing TunBERT model reaches orimproves the state-of-the-art in all three downstream tasks. We release theTunBERT pretrained model and the datasets used for fine-tuning.

Quick Read (beta)

loading the full paper ...