Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Abstract

Transformer networks have a potential of learning longer-term dependency, butare limited by a fixed-length context in the setting of language modeling. As asolution, we propose a novel neural architecture, Transformer-XL, that enablesTransformer to learn dependency beyond a fixed length without disruptingtemporal coherence. Concretely, it consists of a segment-level recurrencemechanism and a novel positional encoding scheme. Our method not only enablescapturing longer-term dependency, but also resolves the problem of contextfragmentation. As a result, Transformer-XL learns dependency that is about 80%longer than RNNs and 450% longer than vanilla Transformers, achieves betterperformance on both short and long sequences, and is up to 1,800+ times fasterthan vanilla Transformer during evaluation. Additionally, we improve thestate-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8,from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (withoutfinetuning). Our code, pretrained models, and hyperparameters are available inboth Tensorflow and PyTorch.

Quick Read (beta)

loading the full paper ...