Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining

Abstract

In recent years, neural models learned through self-supervised pretraining onlarge scale multilingual text or speech data have exhibited promising resultsfor underresourced languages, especially when a relatively large amount of datafrom related language(s) is available. While the technology has a potential forfacilitating tasks carried out in language documentation projects, such asspeech transcription, pretraining a multilingual model from scratch for everynew language would be highly impractical. We investigate the possibility foradapting an existing multilingual wav2vec 2.0 model for a new language,focusing on actual fieldwork data from a critically endangered tongue: Ainu.Specifically, we (i) examine the feasibility of leveraging data from similarlanguages also in fine-tuning; (ii) verify whether the model's performance canbe improved by further pretraining on target language data. Our results showthat continued pretraining is the most effective method to adapt a wav2vec 2.0model for a new language and leads to considerable reduction in error rates.Furthermore, we find that if a model pretrained on a related speech variety oran unrelated language with similar phonological characteristics is available,multilingual fine-tuning using additional data from that language can havepositive impact on speech recognition performance when there is very littlelabeled data in the target language.

Quick Read (beta)

loading the full paper ...