Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

Abstract

We present a simple approach to improve direct speech-to-text translation(ST) when the source language is low-resource: we pre-train the model on ahigh-resource automatic speech recognition (ASR) task, and then fine-tune itsparameters for ST. We demonstrate that our approach is effective bypre-training on 300 hours of English ASR data to improve Spanish-English STfrom 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training datais available. Through an ablation study, we find that the pre-trained encoder(acoustic model) accounts for most of the improvement, which is surprisingsince the shared language in these tasks is the target language (text), and notthe source language (audio). Applying this insight, we show that pre-trainingon ASR helps ST even when the ASR language differs from both source and targetST languages: pre-training on French ASR also improves Spanish-English ST.Finally, we show that the approach improves a true low-resource task:pre-training on a combination of English ASR and French ASR improvesMboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1BLEU.

Quick Read (beta)

loading the full paper ...