Abstract
Child literacy is a strong predictor of life outcomes at the subsequentstages of an individual's life. This points to a need for targetedinterventions in vulnerable low and middle income populations to help bridgethe gap between literacy levels in these regions and high income ones. In thiseffort, reading assessments provide an important tool to measure theeffectiveness of these programs and AI can be a reliable and economical tool tosupport educators with this task. Developing accurate automatic readingassessment systems for child speech in low-resource languages poses significantchallenges due to limited data and the unique acoustic properties of children'svoices. This study focuses on Xhosa, a language spoken in South Africa, toadvance child speech recognition capabilities. We present a novel datasetcomposed of child speech samples in Xhosa. The dataset is available uponrequest and contains ten words and letters, which are part of the Early GradeReading Assessment (EGRA) system. Each recording is labeled with an online andcost-effective approach by multiple markers and a subsample is validated by anindependent EGRA reviewer. This dataset is evaluated with three fine-tunedstate-of-the-art end-to-end models: wav2vec 2.0, HuBERT, and Whisper. Theresults indicate that the performance of these models can be significantlyinfluenced by the amount and balancing of the available training data, which isfundamental for cost-effective large dataset collection. Furthermore, ourexperiments indicate that the wav2vec 2.0 performance is improved by trainingon multiple classes at a time, even when the number of available samples isconstrained.