MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Abstract

The CMU Wilderness Multilingual Speech Dataset is a newly publishedmultilingual speech dataset based on recorded readings of the New Testament. Itprovides data to build Automatic Speech Recognition (ASR) and Text-to-Speech(TTS) models for potentially 700 languages. However, the fact that the sourcecontent (the Bible), is the same for all the languages is not exploited todate. Therefore, this article proposes to add multilingual links between speechsegments in different languages, and shares a large and clean dataset of 8,130para-lel spoken utterances across 8 languages (56 language pairs).We name thiscorpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). Thecovered languages (Basque, English, Finnish, French, Hungarian, Romanian,Russian and Spanish) allow researches on speech-to-speech alignment as well ason translation for syntactically divergent language pairs. The quality of thefinal corpus is attested by human evaluation performed on a corpus subset (100utterances, 8 language pairs). Lastly, we showcase the usefulness of the finalproduct on a bilingual speech retrieval task.

Quick Read (beta)

loading the full paper ...