Abstract
Transformer networks have a potential of learning longer-term dependency, butare limited by a fixed-length context in the setting of language modeling. As asolution, we propose a novel neural architecture, \textit{Transformer-XL}, thatenables Transformer to learn dependency beyond a fixed length withoutdisrupting temporal coherence. Concretely, it consists of a segment-levelrecurrence mechanism and a novel positional encoding scheme. Our method notonly enables capturing longer-term dependency, but also resolves the problem ofcontext fragmentation. As a result, Transformer-XL learns dependency that isabout 80\% longer than RNNs and 450\% longer than vanilla Transformers,achieves better performance on both short and long sequences, and is up to1,800+ times faster than vanilla Transformer during evaluation. Additionally,we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103,from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank(without finetuning). Our code, pretrained models, and hyperparameters areavailable in both Tensorflow and PyTorch.