Speech Language Models for Under-Represented Languages: Insights from Wolof

  • 2025-09-25 08:40:33
  • Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina
  • 0

Abstract

We present our journey in training a speech language model for Wolof, anunderrepresented language spoken in West Africa, and share key insights. Wefirst emphasize the importance of collecting large-scale, spontaneous,high-quality unsupervised speech data, and show that continued pretrainingHuBERT on this dataset outperforms both the base model and African-centricmodels on ASR. We then integrate this speech encoder into a Wolof LLM to trainthe first Speech LLM for this language, extending its capabilities to taskssuch as speech translation. Furthermore, we explore training the Speech LLM toperform multi-step Chain-of-Thought before transcribing or translating. Ourresults show that the Speech LLM not only improves speech recognition but alsoperforms well in speech translation. The models and the code will be openlyshared.

 

Quick Read (beta)

loading the full paper ...