End-to-end Speech Recognition with Word-based RNN Language Models

Abstract

This paper investigates the impact of word-based RNN language models(RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR).In our prior work, we have proposed a multi-level LM, in which character-basedand word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Althoughthis multi-level approach achieves significant error reduction in the WallStreet Journal (WSJ) task, two different LMs need to be trained and used fordecoding, which increase the computational cost and memory usage. In thispaper, we further propose a novel word-based RNN-LM, which allows us to decodewith only the word-based LM, where it provides look-ahead word probabilities topredict next characters instead of the character-based LM, leading competitiveaccuracy with less computation compared to the multi-level LM. We demonstratethe efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, inaddition to WSJ we used in the prior work. Furthermore, we show that theproposed model achieves 5.1 %WER for WSJ Eval'92 test set when the vocabularysize is increased, which is the best WER reported for end-to-end ASR systems onthis benchmark.

Quick Read (beta)

loading the full paper ...