Low-Resource Language Modelling of South African Languages

Abstract

Language models are the foundation of current neural network-based models fornatural language understanding and generation. However, research on theintrinsic performance of language models on African languages has beenextremely limited, which is made more challenging by the lack of large orstandardised training and evaluation sets that exist for English and otherhigh-resource languages. In this paper, we evaluate the performance ofopen-vocabulary language models on low-resource South African languages, usingbyte-pair encoding to handle the rich morphology of these languages. Weevaluate different variants of n-gram models, feedforward neural networks,recurrent neural networks (RNNs), and Transformers on small-scale datasets.Overall, well-regularized RNNs give the best performance across two isiZulu andone Sepedi datasets. Multilingual training further improves performance onthese datasets. We hope that this research will open new avenues for researchinto multilingual and low-resource language modelling for African languages.

Quick Read (beta)

loading the full paper ...