Abstract
Integrating an external language model into a sequence-to-sequence speechrecognition system is non-trivial. Previous works utilize linear interpolationor a fusion network to integrate external language models. However, theseapproaches introduce external components, and increase decoding computation. Inthis paper, we instead propose a knowledge distillation based training approachto integrating external language models into a sequence-to-sequence model. Arecurrent neural network language model, which is trained on large scaleexternal text, generates soft labels to guide the sequence-to-sequence modeltraining. Thus, the language model plays the role of the teacher. This approachdoes not add any external component to the sequence-to-sequence model duringtesting. And this approach is flexible to be combined with shallow fusiontechnique together for decoding. The experiments are conducted on publicChinese datasets AISHELL-1 and CLMAD. Our approach achieves a character errorrate of 9.3%, which is relatively reduced by 18.42% compared with the vanillasequence-to-sequence model.