A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages

Abstract

The proliferation of online offensive language necessitates the developmentof effective detection mechanisms, especially in multilingual contexts. Thisstudy addresses the challenge by developing and introducing novel datasets foroffensive language detection in three major Nigerian languages: Hausa, Yoruba,and Igbo. We collected data from Twitter and manually annotated it to createdatasets for each of the three languages, using native speakers. We usedpre-trained language models to evaluate their efficacy in detecting offensivelanguage in our datasets. The best-performing model achieved an accuracy of90\%. To further support research in offensive language detection, we plan tomake the dataset and our models publicly available.

Quick Read (beta)

loading the full paper ...