Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Abstract

Transformer networks have a potential of learning longer-term dependency, butare limited by a fixed-length context in the setting of language modeling. As asolution, we propose a novel neural architecture, \textit{Transformer-XL}, thatenables Transformer to learn dependency beyond a fixed length withoutdisrupting temporal coherence. Concretely, it consists of a segment-levelrecurrence mechanism and a novel positional encoding scheme. Our method notonly enables capturing longer-term dependency, but also resolves the problem ofcontext fragmentation. As a result, Transformer-XL learns dependency that isabout 80\% longer than RNNs and 450\% longer than vanilla Transformers,achieves better performance on both short and long sequences, and is up to1,800+ times faster than vanilla Transformer during evaluation. Additionally,we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103,from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank(without finetuning). Our code, pretrained models, and hyperparameters areavailable in both Tensorflow and PyTorch.

Quick Read (beta)

loading the full paper ...