AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Abstract

Language models built from various sources are the foundation of today's NLPprogress. However, for many low-resource languages, the diversity of domains isoften limited -- more biased to a religious domain, which impacts theirperformance when evaluated on distant and rapidly evolving domains such associal media. Domain adaptive pre-training (DAPT) and task-adaptivepre-training (TAPT) are popular techniques to reduce this bias throughcontinual pre-training for BERT-based models, but they have not been exploredfor African multilingual encoders. In this paper, we explore DAPT and TAPTcontinual pertaining approaches for the African languages social media domain.We introduce AfriSocial-a large-scale social media and news domain corpus forcontinual pre-training on several African languages. Leveraging AfriSocial, weshow that DAPT consistently improves performance on three subjective tasks:sentiment analysis, multi-label emotion, and hate speech classification,covering 19 languages from 1% to 30% F1 score. Similarly, leveraging TAPT onone task data improves performance on other related tasks. For example,training with unlabeled sentiment data (source) for a fine-grained emotionclassification task (target) improves the baseline results by an F1 scoreranging from 0.55% to 15.11%. Combining these two methods (i.e. DAPT + TAPT)further improves the overall performance.

Quick Read (beta)

loading the full paper ...