Abstract
In recent years, neural models learned through self-supervised pretraining onlarge scale multilingual text or speech data have exhibited promising resultsfor underresourced languages, especially when a relatively large amount of datafrom related language(s) is available. While the technology has a potential forfacilitating tasks carried out in language documentation projects, such asspeech transcription, pretraining a multilingual model from scratch for everynew language would be highly impractical. We investigate the possibility foradapting an existing multilingual wav2vec 2.0 model for a new language,focusing on actual fieldwork data from a critically endangered tongue: Ainu.Specifically, we (i) examine the feasibility of leveraging data from similarlanguages also in fine-tuning; (ii) verify whether the model's performance canbe improved by further pretraining on target language data. Our results showthat continued pretraining is the most effective method to adapt a wav2vec 2.0model for a new language and leads to considerable reduction in error rates.Furthermore, we find that if a model pretrained on a related speech variety oran unrelated language with similar phonological characteristics is available,multilingual fine-tuning using additional data from that language can havepositive impact on speech recognition performance when there is very littlelabeled data in the target language.