A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Abstract

Most speech and language technologies are trained with massive amounts ofspeech and text information. However, most of the world languages do not havesuch resources or stable orthography. Systems constructed under these almostzero resource conditions are not only promising for speech technology but alsofor computational language documentation. The goal of computational languagedocumentation is to help field linguists to (semi-)automatically analyze andannotate audio recordings of endangered and unwritten languages. Example tasksare automatic phoneme discovery or lexicon discovery from the speech signal.This paper presents a speech corpus collected during a realistic languagedocumentation process. It is made up of 5k speech utterances in Mboshi (BantuC25) aligned to French text translations. Speech transcriptions are also madeavailable: they correspond to a non-standard graphemic form close to thelanguage phonology. We present how the data was collected, cleaned andprocessed and we illustrate its use through a zero-resource task: spoken termdiscovery. The dataset is made available to the community for reproduciblecomputational language documentation experiments and their evaluation.

Quick Read (beta)

loading the full paper ...