Scaling Speech Technology to 1,000+ Languages

Abstract

Expanding the language coverage of speech technology has the potential toimprove access to information for many more people. However, current speechtechnology is restricted to about one hundred languages which is a smallfraction of the over 7,000 languages spoken around the world. The MassivelyMultilingual Speech (MMS) project increases the number of supported languagesby 10-40x, depending on the task. The main ingredients are a new dataset basedon readings of publicly available religious texts and effectively leveragingself-supervised learning. We built pre-trained wav2vec 2.0 models covering1,406 languages, a single multilingual automatic speech recognition model for1,107 languages, speech synthesis models for the same number of languages, aswell as a language identification model for 4,017 languages. Experiments showthat our multilingual speech recognition model more than halves the word errorrate of Whisper on 54 languages of the FLEURS benchmark while being trained ona small fraction of the labeled data.

Quick Read (beta)

loading the full paper ...