AfriHuBERT: A self-supervised speech representation model for African languages

Abstract

In this work, we present AfriHuBERT, an extension of mHuBERT-147, astate-of-the-art (SOTA) and compact self-supervised learning (SSL) model,originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16African languages, we expand this to cover 39 African languages throughcontinued pretraining on 6,500+ hours of speech data aggregated from diversesources, including 23 newly added languages. We evaluate AfriHuBERT on two keyspeech tasks: Language Identification (LID) and Automatic Speech Recognition(ASR) using FLEURS dataset. Our results show a +4% F1 score improvement onaverage for LID and a -1.2% average Word Error Rate (WER) reduction for ASR.Further analysis shows that ASR models trained on AfriHuBERT exhibit improvedcross-corpus generalization. Additionally, the analysis indicates that theFLEURS have data quality limitations that may affect their suitability forevaluating low-resource African languages, suggesting the need for betterevaluation benchmarks for these languages.

Quick Read (beta)

loading the full paper ...