MasakhaNEWS: News Topic Classification for African languages

Abstract

African languages are severely under-represented in NLP research due to lackof datasets covering several NLP tasks. While there are individual languagespecific datasets that are being expanded to different tasks, only a handful ofNLP tasks (e.g. named entity recognition and machine translation) havestandardized benchmark datasets covering several geographical andtypologically-diverse African languages. In this paper, we develop MasakhaNEWS-- a new benchmark dataset for news topic classification covering 16 languageswidely spoken in Africa. We provide an evaluation of baseline models bytraining classical machine learning models and fine-tuning several languagemodels. Furthermore, we explore several alternatives to full fine-tuning oflanguage models that are better suited for zero-shot and few-shot learning suchas cross-lingual parameter-efficient fine-tuning (like MAD-X), patternexploiting training (PET), prompting language models (like ChatGPT), andprompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API).Our evaluation in zero-shot setting shows the potential of prompting ChatGPTfor news topic classification in low-resource African languages, achieving anaverage performance of 70 F1 points without leveraging additional supervisionlike MAD-X. In few-shot setting, we show that with as little as 10 examples perlabel, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance offull supervised training (92.6 F1 points) leveraging the PET approach.

Quick Read (beta)

loading the full paper ...