Abstract
Latest development of neural models has connected the encoder and decoderthrough a self-attention mechanism. In particular, Transformer, which is solelybased on self-attention, has led to breakthroughs in Natural LanguageProcessing (NLP) tasks. However, the multi-head attention mechanism, as a keycomponent of Transformer, limits the effective deployment of the model to aresource-limited setting. In this paper, based on the ideas of tensordecomposition and parameters sharing, we propose a novel self-attention model(namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). Wetest and verify the proposed attention method on three language modeling tasks(i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task(i.e., WMT-2016 English-German). Multi-linear attention can not only largelycompress the model parameters but also obtain performance improvements,compared with a number of language modeling approaches, such as Transformer,Transformer-XL, and Transformer with tensor train decomposition.