MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

Abstract

In this paper, we present MasakhaPOS, the largest part-of-speech (POS)dataset for 20 typologically diverse African languages. We discuss thechallenges in annotating POS for these languages using the UD (universaldependencies) guidelines. We conducted extensive POS baseline experiments usingconditional random field and several multilingual pre-trained language models.We applied various cross-lingual transfer models trained with data available inUD. Evaluating on the MasakhaPOS dataset, we show that choosing the besttransfer language(s) in both single-source and multi-source setups greatlyimproves the POS tagging performance of the target languages, in particularwhen combined with cross-lingual parameter-efficient fine-tuning methods.Crucially, transferring knowledge from a language that matches the languagefamily and morphosyntactic properties seems more effective for POS tagging inunseen languages.

Quick Read (beta)

loading the full paper ...