Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal

Abstract

This work is part of the Kallaama project, whose objective is to produce anddisseminate national languages corpora for speech technologies developments, inthe field of agriculture. Except for Wolof, which benefits from some languagedata for natural language processing, national languages of Senegal are largelyignored by language technology providers. However, such technologies are keysto the protection, promotion and teaching of these languages. Kallaama focuseson the 3 main spoken languages by Senegalese people: Wolof, Pulaar and Sereer.These languages are widely spoken by the population, with around 10 million ofnative Senegalese speakers, not to mention those outside the country. However,they remain under-resourced in terms of machine-readable data that can be usedfor automatic processing and language technologies, all the more so in theagricultural sector. We release a transcribed speech dataset containing 125hours of recordings, about agriculture, in each of the above-mentionedlanguages. These resources are specifically designed for Automatic SpeechRecognition purpose, including traditional approaches. To build suchtechnologies, we provide textual corpora in Wolof and Pulaar, and apronunciation lexicon containing 49,132 entries from the Wolof dataset.

Quick Read (beta)

loading the full paper ...