KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

Abstract

This research developed a Kencorpus Swahili Question Answering DatasetKenSwQuAD from raw data of Swahili language, which is a low resource languagepredominantly spoken in Eastern African and also has speakers in other parts ofthe world. Question Answering datasets are important for machine comprehensionof natural language processing tasks such as internet search and dialogsystems. However, before such machine learning systems can perform these tasks,they need training data such as the gold standard Question Answering (QA) setthat is developed in this research. The research engaged annotators toformulate question answer pairs from Swahili texts that had been collected bythe Kencorpus project, a Kenyan languages corpus that collected data from threeKenyan languages. The total Swahili data collection had 2,585 texts, out ofwhich we annotated 1,445 story texts with at least 5 QA pairs each, resultinginto a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of theannotated texts was subjected to re-evaluation by different annotators whoconfirmed that the QA pairs were all correctly annotated. A proof of concept onapplying the set to machine learning on the question answering task confirmedthat the dataset can be used for such practical tasks. The research thereforedeveloped KenSwQuAD, a question-answer dataset for Swahili that is useful tothe natural language processing community who need training and gold standardsets for their machine learning applications. The research also contributed tothe resourcing of the Swahili language which is important for communicationaround the globe. Updating this set and providing similar sets for other lowresource languages is an important research area that is worthy of furtherresearch.

Quick Read (beta)

loading the full paper ...