Abstract
End-to-end models are favored in automatic speech recognition (ASR) becauseof its simplified system structure and superior performance. Among thesemodels, recurrent neural network transducer (RNN-T) has achieved significantprogress in streaming on-device speech recognition because of its high-accuracyand low-latency. RNN-T adopts a prediction network to enhance languageinformation, but its language modeling ability is limited because it stillneeds paired speech-text data to train. Further strengthening the languagemodeling ability through extra text data, such as shallow fusion with anexternal language model, only brings a small performance gain. In view of thefact that Mandarin Chinese is a character-based language and each character ispronounced as a tonal syllable, this paper proposes a novel cascade RNN-Tapproach to improve the language modeling ability of RNN-T. Our approachfirstly uses an RNN-T to transform acoustic feature into syllable sequence, andthen converts the syllable sequence into character sequence through anRNN-T-based syllable-to-character converter. Thus a rich text repository can beeasily used to strengthen the language model ability. By introducing severalimportant tricks, the cascade RNN-T approach surpasses the character-basedRNN-T by a large margin on several Mandarin test sets, with much higherrecognition quality and similar latency.