From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Abstract

In this work, we propose a new parameter-efficient learning framework basedon neural model reprogramming for cross-lingual speech recognition, which can\textbf{re-purpose} well-trained English automatic speech recognition (ASR)models to recognize the other languages. We design different auxiliary neuralarchitectures focusing on learnable pre-trained feature enhancement that, forthe first time, empowers model reprogramming on ASR. Specifically, weinvestigate how to select trainable components (i.e., encoder) of aconformer-based RNN-Transducer, as a frozen pre-trained backbone. Experimentson a seven-language multilingual LibriSpeech speech (MLS) task show that modelreprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) ofits original trainable parameters from a full ASR model to perform competitiveresults in a range of 11.9% to 8.1% WER averaged across different languages. Inaddition, we discover different setups to make large-scale pre-trained ASRsucceed in both monolingual and multilingual speech recognition. Our methodsoutperform existing ASR tuning architectures and their extension withself-supervised losses (e.g., w2v-bert) in terms of lower WER and bettertraining efficiency.

Quick Read (beta)

loading the full paper ...