A Length-Extrapolatable Transformer

Abstract

Position modeling plays a critical role in Transformers. In this paper, wefocus on length extrapolation, i.e., training on short texts while evaluatinglonger sequences. We define attention resolution as an indicator ofextrapolation. Then we propose two designs to improve the above metric ofTransformers. Specifically, we introduce a relative position embedding toexplicitly maximize attention resolution. Moreover, we use blockwise causalattention during inference for better resolution. We evaluate differentTransformer variants with language modeling. Experimental results show that ourmodel achieves strong performance in both interpolation and extrapolationsettings. The code will be available at https://aka.ms/LeX-Transformer.

Quick Read (beta)

loading the full paper ...