Factorized Neural Transducer for Efficient Language Model Adaptation

Abstract

In recent years, end-to-end (E2E) based automatic speech recognition (ASR)systems have achieved great success due to their simplicity and promisingperformance. Neural Transducer based models are increasingly popular instreaming E2E based ASR systems and have been reported to outperform thetraditional hybrid system in some scenarios. However, the joint optimization ofacoustic model, lexicon and language model in neural Transducer also bringsabout challenges to utilize pure text for language model adaptation. Thisdrawback might prevent their potential applications in practice. In order toaddress this issue, in this paper, we propose a novel model, factorized neuralTransducer, by factorizing the blank and vocabulary prediction, and adopting astandalone language model for the vocabulary prediction. It is expected thatthis factorization can transfer the improvement of the standalone languagemodel to the Transducer for speech recognition, which allows various languagemodel adaptation techniques to be applied. We demonstrate that the proposedfactorized neural Transducer yields 15% to 20% WER improvements whenout-of-domain text data is used for language model adaptation, at the cost of aminor degradation in WER on a general test set.

Quick Read (beta)

loading the full paper ...