NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

  • 2022-01-20 16:28:06
  • Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Anuoluwapo Aremu, Saheed Abdul, Pavel Brazdil
  • 6

Abstract

Sentiment analysis is one of the most widely studied applications in NLP, butmost work focuses on languages with large amounts of data. We introduce thefirst large-scale human-annotated Twitter sentiment dataset for the four mostwidely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yoruba)consisting of around 30,000 annotated tweets per language (except forNigerian-Pidgin), including a significant fraction of code-mixed tweets. Wepropose text collection, filtering, processing, and labelling methods thatenable us to create datasets for these low-resource languages. We evaluate arange of pre-trained models and transfer strategies on the dataset. We findthat language-specific models and language-adaptive fine-tuning generallyperform best. We release the datasets, trained models, sentiment lexicons, andcode to incentivize research on sentiment analysis in under-representedlanguages.

 

Quick Read (beta)

loading the full paper ...