Language Modeling with Deep Transformers

Abstract

We explore deep autoregressive Transformer models in language modeling forspeech recognition. We focus on two aspects. First, we revisit Transformermodel configurations specifically for language modeling. We show that wellconfigured Transformer models outperform our baseline models based on theshallow stack of LSTM recurrent neural network layers. We carry out experimentson the open-source LibriSpeech 960hr task, for both 200K vocabulary word-leveland 10K byte-pair encoding subword-level language modeling. We apply ourword-level models to conventional hybrid speech recognition by latticerescoring, and the subword-level models to attention based encoder-decodermodels by shallow fusion. Second, we show that deep Transformer language modelsdo not require positional encoding. The positional encoding is an essentialaugmentation for the self-attention mechanism which is invariant to sequenceordering. However, in autoregressive setup, as is the case for languagemodeling, the amount of information increases along the position dimension,which is a positional signal by its own. The analysis of attention weightsshows that deep autoregressive self-attention models can automatically make useof such positional information. We find that removing the positional encodingeven slightly improves the performance of these models.

Quick Read (beta)

loading the full paper ...