NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

Abstract

Sentiment analysis is one of the most widely studied applications in NLP, butmost work focuses on languages with large amounts of data. We introduce thefirst large-scale human-annotated Twitter sentiment dataset for the four mostwidely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yoruba)consisting of around 30,000 annotated tweets per language (except forNigerian-Pidgin), including a significant fraction of code-mixed tweets. Wepropose text collection, filtering, processing, and labelling methods thatenable us to create datasets for these low-resource languages. We evaluate arange of pre-trained models and transfer strategies on the dataset. We findthat language-specific models and language-adaptive fine-tuning generallyperform best. We release the datasets, trained models, sentiment lexicons, andcode to incentivize research on sentiment analysis in under-representedlanguages.

Quick Read (beta)

loading the full paper ...