Abstract
We present our journey in training a speech language model for Wolof, anunderrepresented language spoken in West Africa, and share key insights. Wefirst emphasize the importance of collecting large-scale, spontaneous,high-quality unsupervised speech data, and show that continued pretrainingHuBERT on this dataset outperforms both the base model and African-centricmodels on ASR. We then integrate this speech encoder into a Wolof LLM to trainthe first Speech LLM for this language, extending its capabilities to taskssuch as speech translation. Furthermore, we explore training the Speech LLM toperform multi-step Chain-of-Thought before transcribing or translating. Ourresults show that the Speech LLM not only improves speech recognition but alsoperforms well in speech translation. The models and the code will be openlyshared.