The first large scale collection of diverse Hausa language datasets

Abstract

Hausa language belongs to the Afroasiatic phylum, and with morefirst-language speakers than any other sub-Saharan African language. With amajority of its speakers residing in the Northern and Southern areas of Nigeriaand the Republic of Niger, respectively, it is estimated that over 100 millionpeople speak the language. Hence, making it one of the most spoken Chadiclanguage. While Hausa is considered well-studied and documented language amongthe sub-Saharan African languages, it is viewed as a low resource language fromthe perspective of natural language processing (NLP) due to limited resourcesto utilise in NLP-related tasks. This is common to most languages in Africa;thus, it is crucial to enrich such languages with resources that will supportand speed the pace of conducting various downstream tasks to meet the demand ofthe modern society. While there exist useful datasets, notably from news sitesand religious texts, more diversity is needed in the corpus. We provide an expansive collection of curated datasets consisting of bothformal and informal forms of the language from refutable websites and onlinesocial media networks, respectively. The collection is large and more diversethan the existing corpora by providing the first and largest set of Hausasocial media data posts to capture the peculiarities in the language. Thecollection also consists of a parallel dataset, which can be used for taskssuch as machine translation with applications in areas such as the detection ofspurious or inciteful online content. We describe the curation process -- fromthe collection, preprocessing and how to obtain the data -- and proffer someresearch problems that could be addressed using the data.

Quick Read (beta)

loading the full paper ...