Abstract
A sufficient amount of annotated data is required to fine-tune pre-trainedlanguage models for downstream tasks. Unfortunately, attaining labeled data canbe costly, especially for multiple language varieties/dialects. We propose toself-train pre-trained language models in zero- and few-shot scenarios toimprove the performance on data-scarce dialects using only resources fromdata-rich ones. We demonstrate the utility of our approach in the context ofArabic sequence labeling by using a language model fine-tuned on ModernStandard Arabic (MSA) only to predict named entities (NE) and part-of-speech(POS) tags on several dialectal Arabic (DA) varieties. We show thatself-training is indeed powerful, improving zero-shot MSA-to-DA transfer by aslarge as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging). Weacquire even better performance in few-shot scenarios with limited labeleddata. We conduct an ablation experiment and show that the performance boostobserved directly results from the unlabeled DA examples for self-training andopens up opportunities for developing DA models exploiting only MSA resources.Our approach can also be extended to other languages and tasks.