Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

Abstract

In this paper, we propose a three-stage training methodology to improve thespeech recognition accuracy of low-resource languages. We explore and proposean effective combination of techniques such as transfer learning, encoderfreezing, data augmentation using Text-To-Speech (TTS), and Semi-SupervisedLearning (SSL). To improve the accuracy of a low-resource Italian ASR, weleverage a well-trained English model, unlabeled text corpus, and unlabeledaudio corpus using transfer learning, TTS augmentation, and SSL respectively.In the first stage, we use transfer learning from a well-trained English model.This primarily helps in learning the acoustic information from a resource-richlanguage. This stage achieves around 24% relative Word Error Rate (WER)reduction over the baseline. In stage two, We utilize unlabeled text data viaTTS data-augmentation to incorporate language information into the model. Wealso explore freezing the acoustic encoder at this stage. TTS data augmentationhelps us further reduce the WER by ~ 21% relatively. Finally, In stage three wereduce the WER by another 4% relative by using SSL from unlabeled audio data.Overall, our two-pass speech recognition system with a Monotonic ChunkwiseAttention (MoChA) in the first pass and a full-attention in the second passachieves a WER reduction of ~ 42% relative to the baseline.

Quick Read (beta)

loading the full paper ...