End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Abstract

Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) basedend-to-end models are widely used in speech recognition due to its simplicityin training and efficiency in decoding. In conventional LSTM-CTC based models,a bottleneck projection matrix maps the hidden feature vectors obtained fromLSTM to softmax output layer. In this paper, we propose to use a high rankprojection layer to replace the projection matrix. The output from the highrank projection layer is a weighted combination of vectors that are projectedfrom the hidden feature vectors via different projection matrices andnon-linear activation function. The high rank projection layer is able toimprove the expressiveness of LSTM-CTC models. The experimental results showthat on Wall Street Journal (WSJ) corpus and LibriSpeech data set, the proposedmethod achieves 4%-6% relative word error rate (WER) reduction over thebaseline CTC system. They outperform other published CTC based end-to-end (E2E)models under the condition that no external data or data augmentation isapplied. Code has been made available at https://github.com/mobvoi/lstm_ctc.

Quick Read (beta)

loading the full paper ...