Text Normalization for Low-Resource Languages of Africa

Abstract

Training data for machine learning models can come from many differentsources, which can be of dubious quality. For resource-rich languages likeEnglish, there is a lot of data available, so we can afford to throw out thedubious data. For low-resource languages where there is much less dataavailable, we can't necessarily afford to throw out the dubious data, in casewe end up with a training set which is too small to train a model. In thisstudy, we examine the effects of text normalization and data set quality for aset of low-resource languages of Africa -- Afrikaans, Amharic, Hausa, Igbo,Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which webuilt in the Pynini framework, a Python library for finite state transducers,and our experiments in training language models for African languages using theNatural Language Toolkit (NLTK), an open-source Python library for NLP.

Quick Read (beta)

loading the full paper ...