Streaming automatic speech recognition with the transformer model

Abstract

Encoder-decoder based sequence-to-sequence models have demonstratedstate-of-the-art results in end-to-end automatic speech recognition (ASR).Recently, the transformer architecture, which uses self-attention to modeltemporal context information, has been shown to achieve significantly lowerword error rates (WERs) compared to recurrent neural network (RNN) based systemarchitectures. Despite its success, the practical usage is limited to offlineASR tasks, since encoder-decoder architectures typically require an entirespeech utterance as input. In this work, we propose a transformer basedend-to-end ASR system for streaming ASR, where an output must be generatedshortly after each spoken word. To achieve this, we apply time-restrictedself-attention for the encoder and triggered attention for the encoder-decoderattention mechanism. Our proposed streaming transformer architecture achieves2.7% and 7.0% WER for the ``clean'' and ``other'' test data of LibriSpeech,which to the best of our knowledge is the best published streaming end-to-endASR result for this task.

Quick Read (beta)

loading the full paper ...