An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Abstract

Utilizing text-only data with an external language model (ELM) in end-to-endRNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a classof methods such as density ratio (DR) and internal language model estimation(ILME) have been developed, outperforming the classic shallow fusion (SF)method. The basic idea behind these methods is that RNN-T posterior shouldfirst subtract the implicitly learned internal language model (ILM) prior, inorder to integrate the ELM. While recent studies suggest that RNN-T only learnssome low-order language model information, the DR method uses a well-trainedneural language model with full context, which may be inappropriate for theestimation of ILM and deteriorate the integration performance. Based on the DRmethod, we propose a low-order density ratio method (LODR) by replacing theestimation with a low-order weak language model. Extensive empiricalexperiments are conducted on both in-domain and cross-domain scenarios onEnglish LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets.It is shown that LODR consistently outperforms SF in all tasks, whileperforming generally close to ILME and better than DR in most tests.

Quick Read (beta)

loading the full paper ...