Self-Training for End-to-End Speech Recognition

Abstract

We revisit self-training in the context of end-to-end speech recognition. Wedemonstrate that training with pseudo-labels can substantially improve theaccuracy of a baseline model by leveraging unlabelled data. Key to our approachare a strong baseline acoustic and language model used to generate thepseudo-labels, a robust and stable beam-search decoder, and a novel ensembleapproach used to increase pseudo-label diversity. Experiments on theLibriSpeech corpus show that self-training with a single model can yield a 21%relative WER improvement on clean data over a baseline trained on 100 hours oflabelled data. We also evaluate label filtering approaches to increasepseudo-label quality. With an ensemble of six models in conjunction with labelfiltering, self-training yields a 26% relative improvement and bridges 55.6% ofthe gap between the baseline and an oracle model trained with all of thelabels.

Quick Read (beta)

loading the full paper ...