Abstract
Latest development of neural models has connected the encoder and decoderthrough a selfattention mechanism. In particular, Transformer, which is solelybased on selfattention, has led to breakthroughs in Natural LanguageProcessing (NLP) tasks. However, the multihead attention mechanism, as a keycomponent of Transformer, limits the effective deployment of the model to aresourcelimited setting. In this paper, based on the ideas of tensordecomposition and parameters sharing, we propose a novel selfattention model(namely Multilinear attention) with BlockTerm Tensor Decomposition (BTD). Wetest and verify the proposed attention method on three language modeling tasks(i.e., PTB, WikiText103 and Onebillion) and a neural machine translation task(i.e., WMT2016 EnglishGerman). Multilinear attention can not only largelycompress the model parameters but also obtain performance improvements,compared with a number of language modeling approaches, such as Transformer,TransformerXL, and Transformer with tensor train decomposition.
Quick Read (beta)
A Tensorized Transformer for Language Modeling
Abstract
Latest development of neural models has connected the encoder and decoder through a selfattention mechanism. In particular, Transformer, which is solely based on selfattention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multihead attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resourcelimited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel selfattention model (namely Multilinear attention) with BlockTerm Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText103 and Onebillion) and a neural machine translation task (i.e., WMT2016 EnglishGerman). Multilinear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, TransformerXL, and Transformer with tensor train decomposition.
A Tensorized Transformer for Language Modeling
Xindian Ma^{1}, Peng Zhang^{1}^{†}^{†}thanks: Corresponding Author: Peng Zhang , Shuai Zhang^{1}, Nan Duan^{2}, Yuexian Hou^{1}, Dawei Song^{3}, Ming Zhou^{2} ^{1}College of Intelligence and Computing, Tianjin University, Tianjin, China ^{2}Microsoft Research Asia, Beijing, China ^{3}School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China {xindianma, pzhang, szhang96, yxhou}@tju.edu.cn {nanduan, mingzhou}@microsoft.com {dwsong}@bit.edu.cn
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
In NLP, Neural language model pretraining has shown to be effective for improving many tasks devlin2018bert ; peters2018deep . Transformer vaswani2017attention is based solely on the attention mechanism, and dispensing with recurrent and convolutional networks entirely. At present, this model has received extensive attentions and plays an key role in many neural language models, such as BERT devlin2018bert , GPT radford2018improving and Universal Transformer dehghani2018universal . However, in Transformer based model, a lot of model parameters may cause problems in training and deploying these parameters in a resourcelimited setting. Thus, the compression of large neural pretraining language models has been an essential problem in NLP research.
In literature, there are some compression methods khrulkov2019tensorized ; ye2018learning ; han2015learning proposed. When the vocabulary is large, the corresponding weight matrices can be enormous. Tensorized embedding (TE) khrulkov2019tensorized uses the tensortrain oseledets2011tensor to compress the embedding layers in TransformerXL dai2019transformer , but has not compressed the attention layer. Recently, BlockTerm Tensor Decomposition(BTD) de2008decompositions is used to compress recurrent neural networks (RNNs) ye2018learning . Ye et al. ye2018learning propose a compact flexible structure to deal with the large number of model parameters instead by high dimensional inputs in training recurrent neural networks (RNNs). This method greatly reduces the parameters of RNNs and improves their training efficiency. Still, the model only considers the input layer compression by the idea of lowrank approximation. On the other hand, some methods han2015learning ; buci2006model aim to develop a specific structure on its weight matrices and can reduce the parameters of the models. However, the new structure after compressing can not be integrated into the model vaswani2017attention .
In Transformer, the multihead attention is a key part and it is constructed by a large number of parameters. Specifically, Ashish et.al vaswani2017attention compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$, while the keys and values are also packed together into matrices $K$ and $V$, respectively. The attention function then adopts a nolinear function $softmax$ over two matrices $Q$ and $K$. There are two challenges to find a highquality compression method to compress the multihead attention in Transformer.
First, the selfattention function in Transformer is a nonlinear function, which makes it difficult to compress. In order to address this challenge, we first prove that the output of the attention function of the selfattention model vaswani2017attention can be linearly represented by a group of orthonormal base vectors. Then, by initializing a low rank core tensor, we use Tuckerdecomposition tucker1966some ; li2017bt to reconstruct a new attention representation, where $Q$, $K$ and $V$ can be considered as factor matrices. In order to construct the multihead mechanism and compress the model, we use the method of BlockTerm Tensor Decomposition (BTD), which is a combination of CP decomposition carroll1970analysis and Tucker decomposition tucker1966some . The difference is that three factor matrices $Q,K$ and $V$ are shared in constructing each $3$order block tensor. This process can reduce many parameters.
The second challenge is that the attention model after compressing can not be directly integrated into the encoder and decoder framework of Transformer vaswani2017attention ; dai2019transformer . In order to address this challenge, there are three steps as follows. First, the average of each block tensor can be computed; Second, multiple matrices can be given by tensor split. Third, the concatenation of these matrices can serve as the input to the next layer network in Transformer. After that, it can be integrated into the encoder and decoder framework of Transformer vaswani2017attention ; dai2019transformer and trained endtoend. Moreover, we also prove that the $3$order tensor can reconstruct the scaled dotproduct attention in Transformer by a sum on a particular dimension.
Our method combines two ideas which are the lowrank approximation and parameters sharing at the same time. Therefore, it achieves the higher compression ratios. Although the selfattention (i.e., scaled dotproduct attention) in Transformer can be reconstructed, we do not consider reconstructing it and choose to split the $3$order tensor (the output of Multilinear attention) which is helpful for improving the accuracy in experiments.
Our major contributions of this paper are as follows:

1)
It is proved that the output of scaled dotproduct attention (considering as a function) can be linearly represented by a group of orthonormal base vectors.

2)
A novel selfattention method, namely Multilinear attention, is provided, which combines two compression ideas, parameters sharing and lowrank approximation, together.

3)
Multilinear attention builds the strong connection between three factor matrices (pack a set of queries, keys and values, respectively ), enhancing the ability of capturing sufficient attention information. We also prove our model can reconstruct the scaled dotproduct attention in the original Transformer.
In order to validate the benefits of our model, we test it on two NLP tasks, namely language modeling and neural machine translation. In our experiments, the multihead attention can be replaced by the proposed model, namely multilinear attention. We have observed that the standard Multihead attention can be compressed with higher compression ratios on OneBillion dataset. As a result, we show that multilinear attention not only considerably reduces the number of parameters, but also achieve promising experiments results, especially in language modeling tasks.
2 Preliminaries
Multilinear attention is carried out in this paper. The analysis of Multilinear attention relies on these concepts and results from the field of tensor decomositon and multihead attention. We cover below in Section 2.1 basic background on BlockTerm tensor decomposition de2008decompositions . Then, we describe in Section 2.2 multihead attention vaswani2017attention .
2.1 Tensor and BlockTerm Tensor Decomposition
Tensor We use the Euler script letter $\mathcal{A}$ to denote a tensor which can be thought of as a multiarray. Thereby a vector and a matrix are a $1$order tensor and $2$order tensor, respectively. The element in a $n$order tensor is denoted as ${\mathcal{A}}_{{d}_{1},\mathrm{\dots},{d}_{n}}$. In the geometric representation of a tensor, $3$order tensor can be represented by a cube. After that, there is a related concept named $tensorslice$ that will be used in this paper. Tensor and some other related concepts are showed in Supplementary Materials A.
BlockTerm Tensor Decomposition (BTD) BlockTerm tensor decomposition is a combination of CP decomposition carroll1970analysis and Tucker decomposition tucker1966some . Given a $n$order tensor $\mathcal{A}\in {\mathbb{R}}^{{d}_{1}\times \mathrm{\dots}\times {d}_{n}}$. A highorder tensor can be decomposed into $P$ block terms by the method named BTD. ${\bullet}_{z}$ is denoted as the tenortensor product on the $z$$th$ order kolda2009tensor and $z\in \{1,\mathrm{\dots},d\}$. Each term contains ${\bullet}_{z}$ between a core tensor ${\mathcal{G}}_{i}\in {\mathbb{R}}^{{R}_{1}\times \mathrm{\dots}\times {R}_{d}}$ and $d$ factor matrices ${\mathcal{X}}_{i}^{(k)}\in {\mathbb{R}}^{{d}_{k}\times {R}_{k}}$, where $i\in [1,P]$ and $k\in [1,d]$. The formulation of BTD decomposition is as follows:
$$\mathcal{A}=\sum _{i=1}^{P}{\mathcal{G}}_{i}{\bullet}_{1}{\mathcal{X}}_{i}^{(1)}{\bullet}_{2}{\mathcal{X}}_{i}^{2}{\bullet}_{3}\mathrm{\dots}{\bullet}_{d}{\mathcal{X}}_{i}^{(d)}$$  (1) 
where $P$ is the CP rank, and $d$ is the Coreorder. In our work, the tensor is $3$order. Figure 1 demonstrates the example of how a $3$order tensor $\mathcal{A}$ can be decomposed into $P$ block terms.
2.2 Multihead Attention
In Transformer, the attention function is named as “Scaled DotProduct Attention”. In practice, Transformer vaswani2017attention processes query, keys and values as matrices $Q$, $K$, and $V$ respectively. The attention function can be written as follows:
$$Attention(Q,K,V)=softmax(\frac{Q{K}^{T}}{\sqrt{d}})V$$  (2) 
where $d$ is the number of columns of $Q$ and $K$. In these work vaswani2017attention ; devlin2018bert ; dai2019transformer , they all use the multihead attention, as introduced in vaswani2017attention ,
$MultiHeadAttention(Q,K,V)$  $=Concat(hea{d}_{1},\mathrm{\dots},hea{d}_{k}){W}^{O}$  (3)  
$wherehea{d}_{i}$  $=Attention(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V})$ 
where matrices ${W}_{i}^{Q}$ and ${W}_{i}^{K}\in {\mathbb{R}}^{{d}_{model}\times d}$, ${W}_{i}^{V}\in {\mathbb{R}}^{{d}_{model}\times d}$ and ${W}^{O}\in {\mathbb{R}}^{h{d}_{v}\times {d}_{model}}$. In practice, ${d}_{v}$ is equal to $d$. In this work vaswani2017attention , multiple groups of parameters (${W}_{i}^{Q}$, ${W}_{i}^{K}$ and ${W}_{i}^{V}$) are used, which results in a large number of redundant parameters.
3 Tensorized Transformer
In this section, we first build a Singleblock attention in Figure 2 (left) based on the Tucker decomposition, a lowrank decomposition method. In this process, we prove that the selfattention function in Transformer can be represented by a linear function, i.e., a linear combination representation of a set of basic vectors.
In order to compress the multihead mechanism, we propose a multilinear attention constructed by a BlockTerm tensor decomposition. This attention uses the idea of parameters sharing, i.e., sharing factor matrices across multiple blocks, shown in Figure 2 (right). After that, the compression ratios and relatively lower complexity have been analyzed.
3.1 Singleblock Attention by Tucker Decomposition
Before building the Singleblock attention, it is necessary to propose the theorem 3.1. The theorem is closely related to attributes of Singleblock attention function by Tucker decomposition tucker1966some .
Theorem 3.1.
Let ${\mathbf{e}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\mathbf{e}}_{n}$ be basis vectors from the vector space $S$. Assume that these vectors ${\mathbf{e}}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\mathbf{e}}_{n}$ are linear independent and $Q$,$K$,$V$ can be linearly represented by this set of basis vectors. The output of the attention function in Eq. 2 can be represented by a linear combination of the set of these basis vectors.
$$Attention(Q,K,V)=({\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n})M,$$  (4) 
where $M\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}d}$ is a coefficient matrix, and $d$ is a dimension of these matrices (i.e., $Q\mathrm{,}K$, and $V$).
Proof.
The proof can be found in Supplementary Materials B. ∎
In Figure 2 (left), it is a schematic diagram about the Singleblock attention. First, we assume that the query, key and value can be mapped into three factor matrices of which are composed of three groups of orthogonal basis vectors. Three factor matrices are $Q$, $K$ and $V$. After that, we can construct a new attention (i.e., Singleblock attention) by initializing a $3$order diagonal tensor (trainable) which is the $\mathcal{G}$. In Figure 2 (left), $R$ is the rank about the tensor, $N$ is the length of a sequence, and $d$ is the dimension of matrix. The function of Singleblock attention can be computed based on Tuckerdecomposition as follows:
$Atte{n}_{TD}(\mathcal{G};Q,K,V)=$  $\mathcal{G}{\bullet}_{1}Q{\bullet}_{2}K{\bullet}_{3}V$  (5)  
$=$  $\sum _{i=1}^{I}}{\displaystyle \sum _{j=1}^{J}}{\displaystyle \sum _{m=1}^{M}}{\mathcal{G}}_{ijm}{Q}_{i}\circ {K}_{j}\circ {V}_{m$ 
where $\mathcal{G}$ is a core tensor. $i,j$ and $m$ are the indexes of the core tensor. $\circ $ is the outer product. ${\bullet}_{z}$ is the same definition in Eq. 1. ${Q}_{i},{K}_{j}$ and ${V}_{k}$ are column vectors from matrices $Q,K$ and $V$, where $Q\in {\mathbb{R}}^{N\times d}$, $K\in {\mathbb{R}}^{N\times d}$ and $V\in {\mathbb{R}}^{N\times d}$, and $N$ is the length of a sequence. In practice, we set $I$=$J$=$M$=$R$. The core tensor $\mathcal{G}$ can be defined as follows,
$${\mathcal{G}}_{ijm}=\{\begin{array}{cc}rand(0,1)\hfill & \hfill i=j=m\\ 0\hfill & \hfill otherwise\end{array}$$  (6) 
where the $rand(0,1)$ is a random function, and the diagonal entries of core tensor $\mathcal{G}$ form the vector $\bm{g}$. Each entry ${\bm{g}}_{r}\in (0,1)$, $r\in \{1,\mathrm{\dots},R\}$. We can consider $\bm{g}$ as the trainable weight. In experiments, we compute the weight vector by $softmax$ function (i.e., $softmax(\bm{g})$).
After that, the output of Singleblock attention function is a $3$order tensor which is given by linear computation. The Singleblock attention (i.e., a $3$order tensor with Tucker decomposition) can reconstruct the Scaled DotProduct attention in Eq. 2 by the summing over the tensor according to the second index ^{1}^{1} 1 If the coordinates of a $3$order tensor are $i,j$ and $m$, $j$ is the second index. (it can be seen as the coordinates in the vertical direction for a tensor), as proved in the following corollary. Note that in our model, we do not adopt the above reconstructing process. Instead, to obtain a new representation, we adopt the concat method after the tensor splitting (see Sec. 3.2). We will further show the compression ability of the Singleblock attention in Sec. 3.3.
Corollary 1.
Under the same conditions as in Theorem 3.1 and the value of $N$ is equal to the value of $d$, Singleblock attention representation Eq. 5 can reconstruct the Scaled DotProduct attention in Eq. 2 by the summing over the tensor (i.e., the output of Singleblock attention function) according to the second index. It holds that:
$$Attention{(Q,K,V)}_{i,m}=\sum _{j=1}^{N}Atte{n}_{TD}{(\mathcal{G};Q,K,V)}_{i,j,m}$$  (7) 
where $i$, $j$ and $m$ are the indices of the Singleblock attention’s output (i.e., a 3order tensor). $A\mathit{}t\mathit{}t\mathit{}e\mathit{}{n}_{T\mathit{}D}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ is the function of Singleblock attention based on Tucker decomposition. $i$ and $m$ are the indices of outputs (i.e., a matrix) from Eq. 2.
Proof.
The proof can be found in Supplementary Materials C. ∎
3.2 MultiLinear Attention by BlockTerm Tensor Decomposition
In order to construct the multihead mechanism and compress the parameters of multiple groups of mapping, we use a group of linear projections, and share the output from the linear projections. In Figure 2(right), the learned linear projection can map queries, keys and values to three matrices which are composed of basis vectors. After that, we use the BlockTerm tensor decomposition to build multihead mechanism. In our work, our model is named as Multilinear attention, which can be formulated as follows:
$MultiLinear(\mathcal{G};Q,K,V)$  $=SplitConcat({\displaystyle \frac{1}{h}}*({T}_{1}+\mathrm{\dots}+{T}_{h})){W}^{O}$  (8)  
$where\mathit{\hspace{1em}}{T}_{j}$  $=Atte{n}_{TD}({\mathcal{G}}_{j};Q{W}^{q},K{W}^{k},V{W}^{v})$ 
where the core tensor ${\mathcal{G}}_{j}$ is a diagonal tensor, and the number of parameter in ${\mathcal{G}}_{j}$ is equal to the rank of core tensor, $j\in \{1,\mathrm{\dots},h\}$. $\mathcal{G}$ is the set of the core tensors. $SplitConcat(\cdot )$ is a function which achieves the concatenation after splitting for a $3$order tensor. Figure 2 (right) shows the basis idea about the multilinear attention. The ${W}^{O}$ is the parameter matrix which is a full connection layer and correlated to the output of Multilinear attention. $Atte{n}_{TD}(\cdot )$ is the function of Singleblock attention, which is a part of Multilinear attention. ${W}^{q}$, ${W}^{K}$ and ${W}^{v}$ are the parameters matrices which are shared in constructing Multilinear attention.
The Multilinear attention is a compression model. After compressing the multihead attention in Transformer, it is to achieve a Tensorized Transformer. The Multilinear attention can be incorporated into Transformer architecture. A diagram which is about the incorporating of Multilinear attention in partial Transformer structure is given in Supplementary Materials E.1.
3.3 Analysis of Compression and Complexity
Compression Our focus is on the compression of the multihead mechanism in the multihead attention of Transformer. Previous work vaswani2017attention gets the multihead attention by multiple groups of linear mappings. We use three linear mappings for matrices $Q$, $K$ and $V$, respectively. For the output of three mappings, we choose to share them which are considered as three factor matrices in reconstructing the Multilinear attention. This process is shown in Figure 2 (left). $h$ is the number of heads in vaswani2017attention , and $d$ is the dimension of factor matrices. The compression ratios can be computed by $(3\times h\times d)/(3\times d+h)$. In practice, $h$ is normally set to $8$, $d$ is set to $512$. In this case, the compression ratios can achieve $8$. In other words, we can reduce almost $8$ times parameters in the attention layer. The details of the computing of compression ratios can be found in Supplementary Materials D. The Transformer also contains other network layers, such as Positionwise feed forward network and embedding layers et al. Therefore, for the compression ratios in whole Transformer, we can compare it by the analysis of experimental results for model parameters.
Complexity The time complexity of the attention function in Eq. 2 is $\mathcal{O}({N}^{2}d)$, $N$ is the length of a sequence, and $d$ is the representation dimension. In Multilinear attention, we can reorder the computations to receive the model complexity $\mathcal{O}({N}^{3})$, where $N$ is also the length of the sequence. The minimum number of sequential operations in Multilinear attention for different layers is approximately equal to the selfattention in Transformer vaswani2017attention .
4 Related Work
The field of language modeling has witnessed many significant advances. Different from the architectures of convolutional neural network (CNNs) and recurrent neural networks (RNNs) language modeling, the Transformer vaswani2017attention and its variants dai2019transformer ; devlin2018bert ; dehghani2018universal achieve excellent results in language modeling processing. Transformer networks have a potential of learning longterm dependency, but are limited by a fixedlength context in the setting of language modeling. Vaswani et al. vaswani2017attention uses a segmentlevel recurrence mechanism and a novel positional encoding scheme to resolve this question. BERT devlin2018bert is a kind of bidirectional encoder representations from transformers. It is designed to pretrain deep bidirectional representation and obtains new SoTA on some NLP tasks. Although these methods have achieved great results, a large number of parameters make it difficult for the model to be trained in limited resources. Transformer fails to generalize in many simple tasks, e.g. copying string and logical inference dehghani2018universal . Universal Transformers dehghani2018universal propose a selfattentive recurrent sequence model which addresses this problem. This methods can increase the training speed. In their work, authors following weight sharing found in CNNs and RNNs, extend the Transformer with a simple form of weight sharing that strikes an effective balance between induces and model expressivity. This methods also uses a large number of parameters.
Therefore, it is very important to consider how to reduce the amount of memory and computing they need. As we know, existing model compression methods are mainly divided into parameter pruning and sharing han2015learning , low rank approximation sainath2013low , knowledge transfer buci2006model , and transferred convolutional filters cohen2016group . Currently, tensor decomposition methods are used to decompose a highorder tensor, which can get different neural network language model structures zhang2019generalized ; zhang2018quantum . Besides, tensor decomposition methods which adopts the idea of low rank approximation in most cases, have been successfully applied to neural networks compression. For example, in literature denton2014exploiting ; jaderberg2014speeding , researchers approximate a tensor by minimizing the reconstruction error of the original parameters on convolutional neural networks (CNNs). However, these approaches tend to accumulate errors when multiple layers are compressed sequentially, and the output feature maps deviate far from the original values with the increase of compressed layers. Our compression method uses the idea of parameters sharing in the constructing of attention layers, and the size of output is same as the output from selfattention in Transformer which can effectively avoid these problems. Tensorizing Neural Networks novikov2015tensorizing have combined the idea of reshaping weights of fullyconnected layers into highdimensional tensors and representing them in Tensor Train format oseledets2011tensor . This approach was later extended to convolutional garipov2016ultimate and recurrent neural networks yang2017tensor . Recently, in these work chen2018groupreduce ; variani2018west , researchers introduce efficient compression methods for the embedding and $softmax$ layers based on structured low rank matrix approximation. TTembedding khrulkov2019tensorized aims to compression the larger embedding layer on TransformerXL dai2019transformer . Sparse Transformer Rewon2019sparse adopts sparse techniques on the attention matrix and reduces its parameters. This work uses a sparse attention matrix by selecting the information on some positions in the attention matrix, but does not change the mechanism of the attention. Our method is different from these works, and combines two compression idea (low rank approximate and parameters sharing) to construct a tensorized Transformer.
In our work, we focus on the compression the multihead attention in Transformer based the idea of parameters sharing. At the same time, we also combine lowrank approximate method to reduce parameters and computation complexity.
5 Experiments
Transformer is a versatile and powerful modeling tool and widely is used in various natural language process tasks. In order to verify the effectiveness of our method (i.e., Multilinear attention) replacing multihead attention in Transformer, we carry out two NLP tasks named language modeling (LM) and neural machine translation (NMT). Code^{2}^{2} 2 https://github.com/szhangtju/ThecompressionofTransformer for running experiments has been released, and the key code which is about our method can be found in Supplementary Materials F.
5.1 Language Modeling
Language modeling is the task of predicting the next word in a sentence. This task is to estimate the joint probability $p(s)$ of a sentence of tokens $s$=$({w}_{1},\mathrm{\dots},{w}_{n})$. The resulting models can be used to generate text or further finetuned to solve other NLP tasks radford2018improving . In this paper, we employ the standard setting of predicting next token given the sequence of preceding tokens, based on the function $p(s)=p({w}_{1}){\prod}_{i=2}^{n}p({w}_{i}{w}_{1},\mathrm{\dots},{w}_{i1})$. We chose three datasets in the order of small (i.e., PTB), medium (i.e., WikiText103) and large (i.e., OneBillion). Models are evaluated based on Perplexity (PPL), which is the average perword logprobability. The lower the PPL, the better the model is.
Specially, we take Transformer, the open source stateofthe art language modeling architecture, and replace the standard multihead attention layers with our Multilinear attention. Then, we test different model configurations on the PTB mikolov2011empirical , WikiText103 merity2016pointer and OneBillion Word benchmark chelba2013one datasets and report the results in Table 1 and Table 2.
Model  Params  Test PPL 
RNN1024+9 Gram chelba2013one  20B  51.3 
LSTM2018512 jozefowicz2016exploring  0.83B  43.7 
GCNN14 bottleneck dauphin2017language  –  31.9 
LSTM81921024+CNN Input jozefowicz2016exploring  1.04B  30.0 
HighBudget MoE shazeer2017outrageously  5B  28.0 
LSTM+Mos yang2017breaking  113M  37.10 
Transformer+adaptive input baevski2018adaptive  0.46B  23.7 
TransformerXL Base dai2019transformer  0.46B  23.5 
TransformerXL Large dai2019transformer  0.8B  21.8 
Tensorized Transformer core$1$  0.16B  20.5 
Tensorized Transformer core$2$  0.16B  19.5 
Model  PTB  WikiText103  
Params  Val PPL  Test PPL  Params  Val PPL  Test PPL  
LSTM+augmented loss inan2016tying  24M  75.7  48.7  –  –  48.7 
Variational RHN zoph2016neural  23M  67.9  65.4  –  –  45.2 
4layer QRNN merity2018analysis  –  –  –  151M  –  33.0 
AWDLSTMMoS yang2017breaking  22M  58.08  55.97  –  29.0  29.2 
Transformer+adaptive input baevski2018adaptive  24M  59.1  57  247M  19.8  20.5 
TransformerXLBase dai2019transformer  24M  56.72  54.52  151M  23.1  24.0 
TransformerXLLarge dai2019transformer  –  –  –  257M  –  18.3 
TransformerXL+TT khrulkov2019tensorized  18 M  57.9*  55.4*  130M  23.61*  25.70* 
Sparse Transformer Rewon2019sparse  14M  74.0*  73.1*  174M  38.98*  40.23* 
Tensorized Transformer core$1$  12M  60.5  57.9  85.3M  22.7  20.9 
Tensorized Transformer core$2$  12M  54.25  49.8  85.3M  19.7  18.9 
5.2 Results and Details
PTB has $929k$ training tokens, $73k$ validation words, and $82k$ test words. The results is reported in Table 2. Similar to AWDLSTMMoS yang2017breaking , we apply variational dropout and weight average to our model (i.e., Tensorized Transformer). In addition, we need to state that, our model only replaces the multihead attention using Multilinear attention structure, and the other structures remain the same. We compare the results of our model with other models. Our model achieves the comparable results with SoTA when the number of core tensor is equal to two. However, our model size (i.e, model parameters) reduces by nearly half comparing with Transformer and TransformerXL.
WikiText103 contains 267,735 unique tokens. The dataset is available wordlevel language modeling benchmark with longterm dependency. It contains 103M training tokens from $28k$ articles, with an average length of 3.6k tokens per article, which allows testing the ability of longterm dependency modeling. As shown in Table 2, our model get the perplexity of $18.9$, which is a comparable experimental result with the previous SoTA perplexity $18.3$ , which demonstrates the effectiveness of the proposed attention architecture.
The OneBillion Word benchmark is a large dataset derived from a news site. The dataset consists of $829,250,940$ tokens over a vocabulary of $793,471$ words. In this dataset, sentences are shuffled and hence the context is limited. Consequently, this dataset mainly tests the ability of modeling only shortterm dependency. The comparison between Tensorized Transformer and the other methods are shown in Table 1. Although Tensorized Transformer is mainly designed to better compress Transformer or TransformerXL model, it dramatically improves the singlemodel SoTA from 21.8 to 19.5. Specifically, Tensorized Transformer significantly outperforms a contemporary method using vanilla Transformers vaswani2017attention , suggesting that the advantage of the tensorized Transformer is also generalizable to modeling short sequences.
Table 2 and Table 1 show that our model get the lower PPL than other models in three datasets. An exciting observation is that our model has much fewer parameters. The model of TransformerXL+TT khrulkov2019tensorized is a recent compression model with Tensor Train to compress the input embedding layers only. Sparse Transformer Rewon2019sparse uses the method of sparse attention matrix to compress Transformer model. The results in Table 2 show that compared with TransformerXL+TT, our method has much fewer parameters, and better language modeling performance. These results verify that our model (i.e., Multilinear attention) is effective in language modeling tasks, and has performed well for the model compression. Other details (such as hyperparameters and Hardware) can be found in Supplementary Materials E.
5.3 Neural Machine Translation
The goal is to map an input sequence $s=({x}_{1},{x}_{2},\mathrm{\dots},{x}_{n})$ representing a phrase in one language, to an output sequence $y=({y}_{1},{y}_{2},\mathrm{\dots},{y}_{m})$ representing the same phrase in a different language. In this task, we have trained the Transformer model vaswani2017attention on WMT 2016 EnglishGerman dataset sennrich2016edinburgh . Sentences were tokenized using the SentencePiece ^{3}^{3} 3 https://github.com/google/sentencepiece. For our experiments, we have replaced each of the attention layers with Multilinear attention in Encoder. For evaluation we used beam search with a beam size of 5 and length penalty $\alpha $=$0.6$. In this section, we only compared the results with Transformer vaswani2017attention . Our results are summarized in Table 3. $*$ indicates that the result is our own implementation.
In Table 3, we select two baseline models. The Baseline sennrich2016edinburgh is first model in WMT 2016 EnglishGerman dataset. For the other baseline, we use the basic Transformer architecture vaswani2017attention . The BLEU score is $34.5$ for the basic architecture. We carry out two Tensorized Transformer structures, namely core1 and core2 respectively. When Tensorized Transformer core1 and core2 are used, the BLEU scores are $34.10$ and $34.91$, which achieves better performance over Transformer. As for the reported model parameter size, our model uses less parameters.
Model  Params  BLEU 
Baseline sennrich2016edinburgh  –  26.8 
Linguistic Input Featurec sennrich2016linguistic  –  28.4 
Attentional encoderdecoder + BPE sennrich2016edinburgh  –  34.2 
Transformer vaswani2017attention  52M  34.5* 
Tensorized Transformer core$1$  21M  34.10 
Tensorized Transformer core$2$  21.2M  34.91 
5.4 Discussion
We have shown the results on language modeling and neural machine translation tasks using the Multilinear attention. For the compression of the model parameters, although we report the parameters of the whole model structure, our method mainly considers the compression of multihead attention but has not changed other layers in Transformer. Regarding the rationale for the improvements, in Corollary 1, we prove that the output of the original attention can be represented by summing over the 3order tensor. In Figure 2, we use a concat function over these matrices from tensor splitting. The operation of concat can model all values in the 3order tensor, and thus captures more information than sum operator. Another reason could be the alleviation of overfitting by reducing parameters. The overfitting will appear when the number of the core tensor is greater than 2. Besides, according to our experiments, relatively large dimensions of the word embedding can lead to overfitting, resulting in performance degradation. Therefore, our model requires a relatively small dimension of the embedding, compared with the original Transformer. In order for a more systematic evaluation, we report more experiments and analyses in Supplementary Materials E.4.
6 Conclusion and Further Work
We have proposed a novel self attention encoder layer, namely the Multilinear attention, to compress the original multihead attention and derive a novel encoding scheme. Our main contribution lies in a structure of Tensorized Transformer based on BlockTerm tensor decomposition which is represented by the combination of a group of $3$order tensors, with lowrank approximation and parameters sharing ideas adopted. Compared with existing Transformer based methods, our model achieved higher compression ratio and got better experimental results, particularly in language modeling task. These evidences imply that our method can potentially be further applied to more NLP tasks with limited resources.
In the future, we will continue to optimize the Tensorized Transformer framework and apply it in other NLP tasks. As we stated earlier, our model may suffer from overfitting when the number of cores is large in language modeling. In the future, we will explore the fundamental reasons that cause the problem and tackle them within the Tensorized Transformer framework.
7 Acknowledgement
This work is supported in part by the state key development program of China (grant No. 2017YFE0111900, 2018YFC0831704), Natural Science Foundation of China (grant No. 61772363, U1636203), and the European Unions Horizon 2020 research and innovation programme under the Marie SkodowskaCurie grant agreement No.721321.
References
 [1] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A quantum manybody wave function inspired language modeling approach. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1303–1312. ACM, 2018.
 [2] Alec Radford Rewon Child, Scott Gray and Ilya Sutskever. Generating long sequences with sparse transformer. arXiv preprint arXiv:1904.10509, 2019.
 [3] Lipeng Zhang, Peng Zhang, Xindian Ma, Shuqin Gu, Zhan Su, and Dawei Song. A generalized language model in tensor space. arXiv preprint arXiv:1901.11167, 2019.
 [4] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
 [5] Cristian Bucilu, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
 [6] J Douglas Carroll and JihJie Chang. Analysis of individual differences in multidimensional scaling via an nway generalization of “eckartyoung” decomposition. Psychometrika, 35(3):283–319, 1970.
 [7] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Computer Science, 2013.
 [8] Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and ChoJui Hsieh. Groupreduce: Blockwise lowrank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998, 2018.
 [9] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999, 2016.
 [10] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019.
 [11] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 933–941. JMLR. org, 2017.
 [12] Lieven De Lathauwer. Decompositions of a higherorder tensor in block terms—part ii: Definitions and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033–1066, 2008.
 [13] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. Published at ICLR2019, 2018.
 [14] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
 [15] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. 2018.
 [16] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
 [17] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [18] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
 [19] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 [20] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 [21] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019.
 [22] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [23] Guangxi Li, Jinmian Ye, Haiqin Yang, Di Chen, Shuicheng Yan, and Zenglin Xu. Btnets: simplifying deep neural networks via block term decomposition. arXiv preprint arXiv:1712.05689, 2017.
 [24] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018.
 [25] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 [26] Tomáš Mikolov, Anoop Deoras, Stefan Kombrink, Lukáš Burget, and Jan Černockỳ. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [27] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in neural information processing systems, pages 442–450, 2015.
 [28] Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 [29] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.
 [30] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. URL https://s3uswest2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
 [31] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6655–6659. IEEE, 2013.
 [32] Rico Sennrich and Barry Haddow. Linguistic input features improve neural machine translation. arXiv preprint arXiv:1606.02892, 2016.
 [33] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891, 2016.
 [34] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 [35] Ledyard R Tucker. Some mathematical notes on threemode factor analysis. Psychometrika, 31(3):279–311, 1966.
 [36] Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. West: Word encoded sequence transducers. arXiv preprint arXiv:1811.08417, 2018.
 [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 [38] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensortrain recurrent neural networks for video classification. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3891–3900. JMLR. org, 2017.
 [39] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 [40] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with blockterm tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378–9387, 2018.
 [41] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [42] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shunichi Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation. John Wiley & Sons, 2009.
Appendix A Tensor and Tensor Slice
As introduced in [42], a tensor and the tensor slice can be defined as follows.
Definition 1 (tensor).
Let ${D}_{\mathrm{1}}$, ${D}_{\mathrm{2}}$, $\mathrm{\dots}$, ${D}_{N}$$\mathrm{\in}\mathrm{N}$ denote index upper bounds. A tensor $\mathrm{A}\mathrm{\in}{\mathrm{R}}^{{D}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{D}_{N}}$ of order $N$ is an $N$way array where elements ${\mathrm{A}}_{{d}_{\mathrm{1}}\mathrm{,}{d}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{d}_{n}}$ are indexed by ${d}_{n}\mathrm{\in}\mathrm{\{}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{,}\mathrm{\dots}\mathrm{,}{D}_{n}\mathrm{\}}$ for $\mathrm{1}\mathrm{\le}n\mathrm{\le}N$.
The concept of tensor slice is specified as:
Definition 2 (tensor slice).
A tensor slice is a twodimensional section (fragment) of a tensor, obtained by fixing all indexes except for two indexes.
Appendix B Theorem 3.1
Let ${\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n}$ be basis vectors from the vector space $S$. Assume that these vectors ${\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n}$ are linear independent and $Q$,$K$,$V$ can be linearly represented by this set of basis vectors. The output of selfattention function in Eq. 2 (in the paper) can be represented by a linear combination of the set of these basis vectors.
$$Attention(Q,K,V)=({\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n})M,$$  (9) 
where $M\in {\mathbb{R}}^{n\times d}$ is a coefficient matrix, and $d$ is a dimension of these matrices (i.e., $Q,K$, and $V$).
Proof.
If $Q$, $K$ and $V$ $\in $ Span(${\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n}$), the linear combination representation of matrices $Q$,$K$ and $V$ can be written as follows:
$$\{\begin{array}{cc}Q=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\mathrm{\dots},{\bm{\alpha}}_{d})\hfill & \\ K=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})({\bm{\beta}}_{1},{\bm{\beta}}_{2},\mathrm{\dots},{\bm{\beta}}_{d})\hfill & \\ V=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})({\bm{\xi}}_{1},{\bm{\xi}}_{2},\mathrm{\dots},{\bm{\xi}}_{d})\hfill & \end{array}$$  (10) 
The selfattention function is written as follows [37]:
$$Attention(Q,K,V)=softmax(\frac{Q{K}^{T}}{\sqrt{d}})V,$$  (11) 
where $Q{K}^{T}$ can be computed as follows:
$Q{K}^{T}=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\mathrm{\dots},{\bm{\alpha}}_{d}){({\bm{\beta}}_{1},{\bm{\beta}}_{2},\mathrm{\dots},{\bm{\beta}}_{d})}^{T}{({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})}^{T}$  (12) 
As a result, the input of $softmax$ function is a product of coefficient matrices $({\bm{\alpha}}_{1},\mathrm{\dots},{\bm{\alpha}}_{d})$ and ${({\bm{\beta}}_{1},\mathrm{\dots},{\bm{\beta}}_{d})}^{T}$. Then, we have
$$softmax(\frac{Q{K}^{T}}{\sqrt{d}})=({\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n})softmax(A/\sqrt{d}){({\bm{e}}_{1},\mathrm{\dots},{\bm{e}}_{n})}^{T}$$  (13) 
where the matrix $A$ is equal to $({\bm{\alpha}}_{1},\mathrm{\dots},{\bm{\alpha}}_{d}){({\bm{\beta}}_{1},\mathrm{\dots},{\bm{\beta}}_{d})}^{T}$. Therefore, the attention representation can be written as follows:
$softmax({\displaystyle \frac{Q{K}^{T}}{\sqrt{d}}})V$  $=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})softmax(A/\sqrt{d})({\bm{\xi}}_{1},{\bm{\xi}}_{2},\mathrm{\dots},{\bm{\xi}}_{d})$  (14)  
$=({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})M$ 
where the matrix $M$ is equal to $softmax(A/\sqrt{d})({\bm{\xi}}_{1},{\bm{\xi}}_{2},\mathrm{\dots},{\bm{\xi}}_{d})$. The $softmax(A/\sqrt{d})$ is to normalize the coefficient matrices of $Q$ and $K$. It turns out that the output of the attention function [37] can be represented by a linear combination of the set of basic vectors. ∎
After the proof, it is helpful to describe the basic idea. First, we consider that the selfattention function can be linearly represented by a set of orthogonal basis vectors, when the input of $softmax$ function is the product of two coefficient matrices, $({\bm{\alpha}}_{1},{\bm{\alpha}}_{2},\mathrm{\dots},{\bm{\alpha}}_{d})$ and ${({\bm{\beta}}_{1},{\bm{\beta}}_{2},\mathrm{\dots},{\bm{\beta}}_{d})}^{T}$, respectively. Second, in constructing the multihead mechanism, the matrices of basis vectors $({\bm{e}}_{1},{\bm{e}}_{2},\mathrm{\dots},{\bm{e}}_{n})$ can be shared.
Appendix C Corollary 1
Under the same conditions as in Theorem 3.1 and the value of $N$ is equal to the value of $d$, the Singleblock attention representation Eq. $5$ (in the paper) can reconstruct the Scaled DotProduct attention in Eq. $2$ (in the paper) by the summing over the tensor (i.e., the output of Singleblock attention function) according to the second index. It holds that:
$$Attention{(Q,K,V)}_{i,m}=\sum _{j=1}^{N}Atte{n}_{TD}{(\mathcal{G};Q,K,V)}_{i,j,m},$$  (15) 
where $i$, $j$ and $m$ are the indices of the Singleblock attention output (i.e., a 3order tensor). $Atte{n}_{TD}(\cdot )$ is the function of the Singleblock attention based on Tucker decomposition. $i$ and $m$ are the indices of outputs (i.e., a matrix) from Eq. $2$ (in the paper).
Proof.
In Theorem 3.1, we have proved the results about the attention function can be represented by a linear combination of basis vectors. Therefore, we can represent the selfattention function in Eq. 2 (in the paper) by the form as follows:
$$Attention(Q,K,V)=\mathrm{\Theta}Q{K}^{T}V$$  (16) 
where $\mathrm{\Theta}$ is a normalization factor matrix, which can be used to replace the use of a $sofmax$ function. We assume that $\mathrm{\Theta}$ contains all the nonzero elements of the core tensor $\mathcal{G}$. The selfattention in Eq. 2 (in the paper) can be rewritten as follows:
$${X}_{i,m}=\sum _{k=1}^{N}\sum _{r=1}^{R}{\mathrm{\Theta}}_{i,m}{Q}_{i,r}{K}_{k,r}{V}_{k,m}$$  (17) 
where $N$ is the length of a sentence, ${X}_{i,m}=Attention{(Q,K,V)}_{i,m}$ is the entry of the output from the selfattention, and $R$ is equal to $d$. Here the core tensor $\mathcal{G}$ is same as that in Eq. $7$ (in the paper). Then, the Singleblock attention (a $3$order tensor) can be represented as follows:
$${\mathcal{A}}_{i,j,m}=\sum _{p}^{R}\sum _{q}^{R}\sum _{r}^{R}{\mathcal{G}}_{p,q,r}{Q}_{i,p}{K}_{j,p}{V}_{m,r}$$  (18) 
where $\mathcal{A}$ is a $3$order tensor, which is equal to $Atte{n}_{TD}(\mathcal{G};Q,K,V)$. Accordingly, ${\mathcal{A}}_{i,j,m}$ is a entry in tensor $\mathcal{A}$ and is equal to $Attention_{TD}{}_{i,j,m}$ in Eq. 15. Next, we aim to prove Eq. 7 can be established. Therefore, we need to establish the relation between Eq. 18 and Eq. 17. Since the core tensor $\mathcal{G}$ is a special tensor (i.e., diagonal tensor), Eq. 18 can be written as follows:
$${\mathcal{A}}_{i,j,m}=\sum _{r=1}^{R}{\mathcal{G}}_{r,r,r}{Q}_{i,r}{K}_{j,r}{V}_{m,r}$$  (19) 
After that, we can compute the attention representation through adding to model $k$. For better understanding, we give the graph representation in Figure 3.
$${X}_{i,m}=\sum _{r=1}^{R}\sum _{j=1}^{N}{\mathcal{G}}_{rrr}{Q}_{i,r}{K}_{j,r}{V}_{m,r}$$ 
The corollary then holds. ∎
Appendix D Compression Ratio about MultiLinear Attention
In order to compute the compression ratio, we need to compare multilinear attention with multihead attention. The comparison chart has been given in Figure 4.
In Figure 4, each $Linear$ function in multihead attention is about a weight matrix $W\in {\mathbb{R}}^{{d}_{model}\times d}$, and all weight matrices in multihead attention are different. In multilinear attention, three weight matrices are used and $h$ (a number) weight vectors are used. Through the analysis about Figure 4, the compression ratio is computed as follows.
$$\begin{array}{cc}\hfill compressionratio=& \frac{3\times h\times {d}_{model}\times d}{3\times {d}_{model}\times d+h\times d}\hfill \\ \hfill =& \frac{3\times h\times {d}_{model}}{3\times {d}_{model}+h}\hfill \end{array}$$  (20) 
In practice, $h$ is equal to $8$ and $d$ is equal to $512$. The compression ratio approximates $8$ in this case. In our work, the dimension of vector ${\mathcal{G}}_{r}$ is set as $R$ which is smaller than $d$, where $d$ is the dimension of attention matrix.
Lowrank Approximation for Model Compression In the paper, we have described that our method combines two compression ideas, namely lowrank approximation and parameters sharing. Parameters sharing can be understood through the description of Figure 4. In Multilinear attention, the idea of lowrank decomposition also has the function of model compression. We have proved that the Singleblock attention can reconstruct an onehead selfattention in Transformer. In order to obtain the representation of a tensorized attention, we adopt the tensor splitting and the concat function. After that, we consider that each tensor slice from tensor splitting approximates the output of the selfattention function Eq. 2 (in the paper). When we only focus on the idea of lowrank approximation, the compression ratio can be computed by the form, $\frac{N\times d}{N\times N}$, where $N$ is the length of a sequence, $d$ is the dimension of a matrix (also namely hidden size). $N$ is smaller than $d$, normally.
Through combining the ideas of parameters sharing and lowrank approximation, by formally considering the rank $R$, the compression ratio of Multilinear attention model can be computed as follows:
$$compressionrati{o}_{R}=\frac{3\times h\times {d}_{model}\times d}{3\times {d}_{model}\times d+R\times h},$$  (21) 
where $R$ is the rank of the core tensor $\mathcal{G}$. The compression ratio will be larger when $R$ is smaller. This $compressionrati{o}_{R}$ is the compression ratio associated with $R$. $R$ need to be set in practice. In experiments, $R$ can be set to $18$, which is smaller than ${d}_{model}$.
Appendix E Experiment
E.1 Partial Structure about Tensorized Transformer
in the paper, the multilinear attention is proposed. In order to show that the process of incorporating multilinear attention into Transformer, Figure 5 gives out some information about the structure.
E.2 Experimental Details in Language Modeling
Now, we report some details of experiments as a relevant supplementary material. Firstly, we use three weight matrices ${W}^{q}$,${W}^{k}$ and ${W}^{v}$ to linearly project the queries, keys and values. The outputs from the linear projections can be shared by $h$ times, where $h$ is the number of core tensors in our background (i.e., core1($h$=$1$), core2($h$=$2$)). We use Block Term Tensor decomposition (BTD) to construct a new representation, namely Multilinear attention, which is a $3$order tensor. For incorporating the proposed attention into the architecture of Transformer, we split the $3$order tensor, and then concat each matrix from the tensor. For other layers, we use the same structure as vanillaTransformer.
Hardware
We trained our model on one machine with 2 NVIDIA P40 GPUs. For our base models, the hyperparameters are described in Table 4. In addition, we set the $dropout$=$0.3$ in all datasets. The model is trained using $30$ epochs in three datasets (PTB, WikiText103 and OneBillion).
Datasets  ${d}_{head}$  ${d}_{ff}$  h  L  ${d}_{k}$  ${d}_{v}$  Test PPL 

PTB  256  2100  2  3  40  40  49.8 
WikiText103  256  2100  2  6  40  40  18.9 
OneBillion  1024  2100  2  6  40  40  19.5 
Optimizer We used the Adam optimizer and vary the learning rate over the course of training. The vary formula [37] is followed in our work. We also used the $warmup\mathrm{\_}steps=4000$. Label Smoothing is employed with the value $\u03f5$=$0.1$.
E.3 Experiment Details in Neural Machine Translation
The Tensorized Transformer also has been applied to Neural Machine Translation task. In this experiment, we use the same setup with Transformer [37], and replace the multihead attention with the proposed multilinear attention in the encoder structure. In the decoder structure, we still use the multihead attention for verifying the effectiveness of encoding a sentence. The model is trained in 1 NVIDA P40 GPUs.
E.4 Experimental comparison
For a more detailed comparison, we design these experiments as follows. In this section, we mainly show the experimental results on two language modeling datasets, i.e., PTB and WikiText103. We show the value of perplexity, as well as FLOPs^{4}^{4} 4 FLOPs:The number of floatingpoint operations correspondingly.
Model  PTB  WikiText103  

Params  FLOPS  Test PPL  Params  FLOPS  Test PPL  
TransformerXL [10]  24M  11.5B  59.1  257M  996.5B  18.3 
Tensorized Transformer  24M  5.4B  52.7  257M  312.0B  21.2 
TransformerXL [10]  –  –  –  151M  126.5B  24.0 
Tensorized Transformer  –  –  –  151M  83.4B  18.8 
TransformerXL [10]  12M  4.5B  87.8  85.5M  22.0B  34.8 
Tensorized Transformer  12M  0.75B  57.9  85.5M  17.3B  20.9 
Datasets  Model  ${d}_{head}$  ${d}_{model}$  $N$  $L$  dropout  Test PPL 
PTB  TransformerXL  40  256  30  3  0.3  81.2 
PTB  Tensorized Transformer  40  256  30  3  0.3  50.2 
WikiText103  TransformerXL  40  256  80  6  0.1  34.86 
WikiText103  Tensorized Transformer  40  256  80  6  0.1  19.9 
OneBillion  TransformerXL  40  1024  100  6  0.1  43.6 
OneBillion  Tensorized Transformer  40  1024  100  6  0.1  26.7 
To further compare the experimental results under the same size of parameters between Tensorized Transformer and the baseline model (i.e., TransformerXL), we add some experiments in Table 5. In our paper, we use the Tensorized Transformer of 12M on PTB dataset and the Tensorized Transformer of 85.5M on WikiText103 dataset. To achieve the same size, i.e., 12M for TransformerXL on PTB, we can reduce the dimensions of $Q,K,V$ from $40$ to $26$. To achieve TransformerXL of 85.5M, we reduce the dimensions of word embedding from $512$ to $256$. The experimental results are shown in Table 5. Our model gets better results and the lower FLOPS than TransformerXL.
On the other hand, we can also increase the parameters of Tensorized Transformer to reach the parameters reported by TransformerXL on PTB and WikiText103 datasets. On PTB dataset, we can increase the number of layers from $3$ to $7$ to get the Tensorized Transformer of 24M. On WikiText103 dataset, we increase the number of layers from $6$ to $11$ and the length of sequence from $80$ to $120$ to get the Tensorized Transformer of 257M. We can increase the number of layers from $6$ to $8$ and the length of sequence from $80$ to $100$ to get the Tensorized Transformer of 151M. After that, Tensorized Transformer achieves better results and lower FLOPS than TransformerXL. These results are shown in Table 5.
In addition, we also carry out experiments when TransformerXL has the same hyperparameters with Tensorized Transformer. Experimental results are shown in Table 6. Table 6 shows that our model can get the better results than TransformerXL. Besides, on two datasets (i.e., PTB and WiliText103), we also try to train our model (Tensorized Transformer) using larger dimension of word embedding (i.e., ${d}_{model}$). If ${d}_{model}$ is larger than $256$ on PTB dataset and larger than $512$ on WikiText103 dataset, the overfitting will occur. For the overfitting problem, we will investigate it in our future work.
Appendix F Partial Code
The project have been achieved by pytorch. In this section, we give the partial code which is about our methods, i.e., Singblock attention and Multilinear attention. First, the class of Singleblock attention is given as follows.
Each Single block attention is a component of Multilinear attention. Based on the Single block attention, the Multilinear attention can be given as follows.