Abstract
Transformer networks have a potential of learning longer-term dependency, butare limited by a fixed-length context in the setting of language modeling. As asolution, we propose a novel neural architecture, Transformer-XL, that enablesTransformer to learn dependency beyond a fixed length without disruptingtemporal coherence. Concretely, it consists of a segment-level recurrencemechanism and a novel positional encoding scheme. Our method not only enablescapturing longer-term dependency, but also resolves the problem of contextfragmentation. As a result, Transformer-XL learns dependency that is about 80%longer than RNNs and 450% longer than vanilla Transformers, achieves betterperformance on both short and long sequences, and is up to 1,800+ times fasterthan vanilla Transformer during evaluation. Additionally, we improve thestate-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8,from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (withoutfinetuning). Our code, pretrained models, and hyperparameters are available inboth Tensorflow and PyTorch.