A Tensorized Transformer for Language Modeling

  • 2019-08-09 09:27:41
  • Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, Ming Zhou
  • 0

Abstract

Latest development of neural models has connected the encoder and decoderthrough a self-attention mechanism. In particular, Transformer, which is solelybased on self-attention, has led to breakthroughs in Natural LanguageProcessing (NLP) tasks. However, the multi-head attention mechanism, as a keycomponent of Transformer, limits the effective deployment of the model to alimited resource setting. In this paper, based on the ideas of tensordecomposition and parameters sharing, we propose a novel self-attention model(namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). Wetest and verify the proposed attention method on three language modeling tasks(i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task(i.e., WMT-2016 English-German). Multi-linear attention can not only largelycompress the model parameters but also obtain performance improvements,compared with a number of language modeling approaches, such as Transformer,Transformer-XL, and Transformer with tensor train decomposition.

 

Quick Read (beta)

A Tensorized Transformer for Language Modeling

Xindian Ma1, Peng Zhang1  , Shuai Zhang1, Nan Duan2, Yuexian Hou1, Dawei Song3, Ming Zhou2
1College of Intelligence and Computing, Tianjin University, Tianjin, China
2Microsoft Research Asia, Beijing, China
3School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
{xindianma, pzhang, szhang96, yxhou}@tju.edu.cn
{nanduan, mingzhou}@microsoft.com
{dwsong}@bit.edu.cn
Corresponding Author: Peng Zhang
Abstract

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a limited resource setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.

 

A Tensorized Transformer for Language Modeling


  Xindian Ma1, Peng Zhang1thanks: Corresponding Author: Peng Zhang  , Shuai Zhang1, Nan Duan2, Yuexian Hou1, Dawei Song3, Ming Zhou2 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Microsoft Research Asia, Beijing, China 3School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China {xindianma, pzhang, szhang96, yxhou}@tju.edu.cn {nanduan, mingzhou}@microsoft.com {dwsong}@bit.edu.cn

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

In NLP, Neural language model pre-training has shown to be effective for improving many tasks devlin2018bert ; peters2018deep . Transformer vaswani2017attention is based solely on the attention mechanism, and dispensing with recurrent and convolutions entirely. At present, this model has received extensive attentions and plays an key role in many neural language models, such as BERT devlin2018bert , GPT radford2018improving and Universal Transformer dehghani2018universal . However, in Transformer based model, a lot of model parameters may cause problems in training and deploying these parameters in a limited resource setting. Thus, the compression of large neural pre-training language model has been an essential problem in NLP research.

In literature, there are some compression methods khrulkov2019tensorized ; ye2018learning ; han2015learning proposed. When the vocabulary is large, the corresponding weight matrices can be enormous. Tensorized embedding (TE) khrulkov2019tensorized uses the way of tensor-train oseledets2011tensor to compress the embedding layers in Transformer-XL dai2019transformer . In TE khrulkov2019tensorized , researchers only study the compression of input embedding layers, rathar than the attention layer. Recently, Block-Term Tensor Decomposition(BTD) de2008decompositions is used to compress recurrent neural networks (RNNs) ye2018learning . Ye et al. ye2018learning propose a compact flexible structure to deal with the large number of model parameters instead by high dimensional inputs in training recurrent neural networks (RNNs). This method greatly reduces the parameters of RNNs and improves their training efficiency. Still, the model only considers the input layer compression by the idea of low-rank approximation. On the other hand, some methods han2015learning ; buci2006model aim to develop a specific structure on its weight matrices and are successful in compressing the pre-trained models. However, the new structure after compressing can not be integrated into the model.

In Transformer, the multi-head attention is a key part and it is constructed by a large number of parameters. Specifically, Ashish et.al vaswani2017attention compute the attention function on a set of queries simultaneously, packed together into a matrix Q, while the keys and values are also packed together into matrices K and V, respectively. The attention function then adopts a no-linear function softmax over three matrices Q, K and V. There are two challenges to find a high-quality compression method to compress the multi-head attention in Transformer.

First, the self-attention function in Transformer is a non-linear function, which makes it difficult to compress. In order to address this challenge, we first prove that the output of the attention function of the self-attention model vaswani2017attention can be linearly represented by a group of orthonormal base vectors. Q, K and V can be considered as factor matrices. Then, by initializing a low rank core tensor, we use Tucker-decomposition tucker1966some ; li2017bt to reconstruct a new attention representation. In order to construct the multi-head mechanism and compress the model, we use the method of Block-Term Tensor Decomposition (BTD), which is a combination of CP decomposition carroll1970analysis and Tucker decomposition tucker1966some . The difference is that three factor matrices Q,K and V are shared in constructing each 3-order block tensor. This process can lead to reduce many parameters.

The second challenge is that the attention model after compressing can not be directly integrated into the encoder and decoder framework of Transformer vaswani2017attention ; dai2019transformer . In order to address this challenge, there are three steps as follows. First, the average of each block tensor can be computed; Second, some matrices can be given by tensor split. Third, the concatenation of these matrices can serve as the input to the next layer network in Transformer. After that, it can be integrated into the encoder and decoder framework of Transformer vaswani2017attention ; dai2019transformer and trained end-to-end. Moreover, we also prove that the 3-order tensor can reconstruct the scaled dot-product attention in Transformer by a sum on a particular dimension.

Our method combines two ideas which are the low-rank approximation and parameters sharing at the same time. Therefore, it achieves the higher compression ratios. Although the self-attention (i.e., scaled dot-product attention) in Transformer can be reconstructed, we do not consider reconstructing it and choose to split the 3-order tensor (the output of Multi-linear attention) which is helpful for improving the accuracy in experiments.

Our major contributions of this paper are as follows:

  • 1)

    It is proved that the output of scaled dot-product attention (considering as a function) can be linearly represented by a group of orthonormal base vectors.

  • 2)

    A novel self-attention method, namely Multi-linear attention, is provided, which combines two compression ideas, parameters sharing and low-rank approximation, together.

  • 3)

    Multi-linear attention builds the strong connection between three factor matrices (pack a set of queries, keys and values, respectively ), enhancing the ability of capturing sufficient attention information. We also prove our model can reconstruct the scaled dot-product attention in the original Transformer.

In order to validate the benefits of our model, we test it on two NLP tasks, namely language modeling and neural machine translation. In our experiments, the multi-head attention can be replaced by the proposed model, namely multi-linear attention. We have observed that the standard multi-head attention can be compressed with higher compression ratios on One-Billion dtaset. As a result, we show that multi-linear attention not only considerably reduces the number of parameters, but also achieve promising experiments results, especially in language modeling tasks.

2 Preliminaries

Multi-linear attention is carried out in this paper. The analysis of Multi-linear attention relies on these concepts and results from the field of tensor decomositon and multi-head attention. We cover below in Section 2.1 basic background on Block-Term tensor decomposition de2008decompositions . Then, we describe in Section 2.2 multi-head attention vaswani2017attention .

2.1 Tensor and Block-Term Tensor Decomposition

Tensor We use the Euler script letter 𝒜 to denote a tensor which can be thought of as a multi-array. Thereby a vector and a matrix is a 1-order tensor and 2-order tensor, respectively. The element in a n-order tensor is denoted as 𝒜d1,,dn. In the geometric representation of a tensor, 3-order tensor can be representation by a cube. After that, there is a related concept named tensorslice that will be used in this paper. Tensor and some other related concepts are shows in Supplementary Materials A.

Block-Term Tensor Decomposition (BTD) Block-Term tensor decomposition is a combination of CP decomposition carroll1970analysis and Tucker decomposition tucker1966some . Given a n-order tensor 𝒜d1××dn. A high-order tensor can be decomposed into P block terms by the method named BTD. z is denoted as the tenor-tensor product on the z-th order kolda2009tensor and z{1,,d}. Each term contains z between a core tensor 𝒢iR1××Rd and d factor matrices 𝒳i(k)dk×Rk, where i[1,P] and k[1,d]. The formulation of BTD decomposition is as follows:

𝒜=i=1P𝒢i1𝒳i(1)2𝒳i23d𝒳i(d) (1)

where P is the CP rank, and d is the Core-order. In our work, we consider a tensor is 3-order tensor. Figure 1 demonstrates the example of how a 3-order tensor 𝒜 can be decomposed into P block terms.

Figure 1: The representation of Block-Term tensor decomposition for a 3-order tensor. 𝒜d1×d2×d3 is a 3-order tensor, and can be approximated by P Tucker decomposition. P is the CP rank, and R1,R2,R3 are the Tucker rank, respectively. In this paper, we assume that R=R1=R2=R3.

2.2 Multi-head Attention

In Transformer, the attention function is named as “Scaled Dot-Product Attention”. In practice, Transformer vaswani2017attention processes query, keys and values as matrices Q, K, and V respectively. The attention function can be written as follows:

Attention(Q,K,V)=softmax(QKTd)V (2)

where d is the number of columns of Q and K. In these work vaswani2017attention ; devlin2018bert ; dai2019transformer , they all use the multi-head attention, as introduced in vaswani2017attention ,

MultiHeadAttention(Q,K,V) =Concat(head1,,headk)WO (3)
whereheadi =Attention(QWiQ,KWiK,VWiV)

where matrices WiQ and WiKdmodel×d, WiVdmodel×d and WOhdv×dmodel. In practice, dv is equal to d. In this work vaswani2017attention , multiple groups of parameters (WiQ, WiK and WiV) are used, which results in a large number of redundant parameters.

3 Tensorized Transformer

In this section, we first build a Single-block attention in Figure 2 (left) based on the Tucker decomposition, a low-rank decomposition method. In this process, we prove that the self-attention function in Transformer can be represented by a linear function, i.e., a linear combination representation of a set of basic vectors.

In order to compress the multi-head mechanism, we propose a multi-linear attention constructed by a Block-Term tensor decomposition. This attention uses the idea of parameters sharing, i.e., sharing factor matrices across multiple blocks, shown in Figure 2 (right). After that, the compression ratios and relatively lower complexity have been analyzed.

Figure 2: (left) Single-block attention using Tucker decomposition. (right) Multi-linear attention based on Block-Term tensor decomposition.

3.1 Single-block Attention by Tucker Decomposition

Before building the Single-block attention, it is necessary to propose the theorem 3.1. The theorem is closely related to attributes of Single-block attention function by Tucker-decomposition tucker1966some .

Theorem 3.1.

Let 𝐞1,,𝐞n be basis vectors from the vector space S. Assume that these vectors 𝐞1,,𝐞n are linear independent. The output of the attention function in Eq. 2 can be represented by a linear combination of the set of these basis vectors.

Attention(Q,K,V)=(𝒆1,,𝒆n)M, (4)

where MRn×d is a coefficient matrix, and d is a dimension of these matrices (i.e., Q,K, and V).

Proof.

The proof can be found in Supplementary Materials B. ∎

In Figure 2 (left), it is a schematic diagram about the Single-block attention. First, we assume that the query, key and value can be mapped into three factor matrices of which are composed of three groups of orthogonal basis vectors. Three factor matrices are Q, K and V. After that, we can construct a new attention (i.e., Single-block attention) by initializing a 3-order diagonal tensor (trainable) which is the 𝒢. In Figure 2 (left), R is the rank about the tensor, N is the length of a sequence, and d is the dimension of matrix. The function of Single-block attention can be computed based on Tucker-decomposition as follows:

AttenTD(𝒢;Q,K,V)= 𝒢1Q2K3V (5)
= i=1Ij=1Jm=1M𝒢ijmQiKjVm

where 𝒢 is a core tensor. i,j and m are the indexes of the core tensor. is the outer product. z is the same definition in Eq. 1. Qi,Kj and Vk are column vectors from matrices Q,K and V, where QN×d, KN×d and VN×d,and N is the length of a sequence. In practice, we set I=J=M=R. The core tensor 𝒢 can be defined as follows,

𝒢ijm={rand(0,1)i=j=m0otherwise (6)

where the rand(0,1) is a random function, and the diagonal entries of core tensor 𝒢 form the vector 𝒈. Each entry 𝒈r(0,1), r{1,,R}. We can consider 𝒈 as the trainable weight. In experiments, we compute the weight vector by softmax function (i.e., softmax(𝒈)).

After that, the output of Single-block attention function is a 3-order tensor which is given by linear computation. The Single-block attention (i.e., a 3-order tensor with Tucker decomposition) can reconstruct the Scaled Dot-Product attention in Eq. 2 by the summing over the tensor according to the second index 11 1 If the coordinates of a 3-order tensor are i,j and m, j is the second index. (it can be seen as the coordinates in the vertical direction for a tensor), as proved in the following corollary. Note that in our model, we do not adopt the above reconstructing process. Instead, to obtain a new representation, we adopt the concat method after the tensor splitting (see Sec. 3.2). We will further show the compression ability of the Single-block attention in Sec. 3.3.

Corollary 1.

Under the same conditions as in Theorem 3.1 and the elements in each row of the matrix V are the same, Single-block attention representation Eq. 5 can reconstruct the Scaled Dot-Product attention in Eq. 2 by the summing over the tensor (i.e., the output of Single-block attention function) according to the second index. It holds that:

Attention(Q,K,V)i,m=j=1dAttenTD(𝒢;Q,K,V)i,j,m (7)

where i, j and m are the indices of the Single-block attention’s output (i.e., a 3-order tensor), and d is the dimension for the second index. AttenTD() is the function of Single-block attention based on Tucker decomposition. i and m are the indices of outputs (i.e., a matrix) from Eq. 2.

Proof.

The proof can be found in Supplementary Materials C. ∎

3.2 Multi-Linear Attention by Block-Term Tensor Decomposition

In order to construct the multi-head mechanism and compress the parameters of multiple groups of mapping parameters, we use a group of linear projections, and share the output from the linear projections. In Figure 2(right), the learned linear projection can map queries, keys and values to three matrices which are composed of basis vectors. After that, we use the Block-Term tensor decomposition to build multi-head mechanism. In our work, our model is named as Multi-linear attention, which can be formulated as follows:

MultiLinear(𝒢;Q,K,V) =SplitConcat(1h*(T1++Th))WO (8)
whereTj =AttenTD(𝒢j;QWq,KWk,VWv)

where the core tensor 𝒢j is a diagonal tensor, and the number of parameter in 𝒢j is equal to the rank of core tensor, j{1,,h}. 𝒢 is the set of the core tensors. SplitConcat() is a function which achieves the concatenation after splitting for a 3-order tensor. Figure 2 (right) shows the basis idea about the multi-linear attention. The WO is the parameter matrix which is a full connection layer and correlated to the output of Multi-linear attention. AttenTD() is the function of Single-block attention, which is a partion of Multi-linear attention. Wq, WK and Wv are the parameters matrices which are shared in constructing Multi-linear attention.

The Multi-linear attention is a compression model. After compressing the multi-head attention in Transformer, it is to achieve a Tensorized Transformer. The Multi-linear attention can be incorporated into Transformer architecture. A diagram which is about the incorporating of Multi-linear attention in partial Transformer structure is given in Supplementary Materials E.1.

3.3 Analysis of Compression and Complexity

Compression Our focus is on the compression of the multi-head mechanism in the multi-head attention of Transformer. Previous work vaswani2017attention gets the multi-head attention by multiple groups of linear mappings. We use three linear ma for matrices QK and V, respectively. For the output of three mappings, we choose to share them which are considered as three factor matrices in reconstructing the Multi-linear attention. This process is shown in Figure 2 (left). h is the number of heads in vaswani2017attention , and d is the dimension of factor matrices. The compression ratios can be computed by (3×h×d)/(3×d+h). In practice, h is normally set to 8, d is set to 512. In this case, the compression raio can achive 8. In other words, we can reduce almost 8 times parameters in the attention layer. The details of the computing of compression ratios can be found in Supplementary Materials D. The Transformer also contains other network layers, such as Position-wise feed forward network and embedding layers et al. Therefore, for the compression ratios in whole Transformer, we can compare it by the analysis of experimental results for model parameters.

Complexity Eq. 5 reduces the time complexity in the attention layer. The time complexity of the attention function in Eq. 2 is 𝒪(N2d), N is the length of a sequence, and d is the representation dimension. However, we can reorder the computations to reduce the model complexity 𝒪(R2d), where R is the rank of the tensor which can be set in our experiments. In our experiments, R is set as the number between 10 and 18 which is smaller than N. The minimum number of sequential operations in Multi-linear attention for different layers is lower than that of the self-attention in Transformer vaswani2017attention .

4 Related Work

The field of language modeling has witnessed many significant advances. Different from the architectures of convolutional neural network (CNNs) and recurrent neural networks (RNNs) language modeling, the Transformer vaswani2017attention and its variants dai2019transformer ; devlin2018bert ; dehghani2018universal achieve excellent results in language modeling processing. Transformer networks have a potential of learning long-term dependency, but are limited by a fixed-length context in the setting of language modeling. Vaswani et al. vaswani2017attention uses a segment-level recurrence mechanism and a novel positional encoding scheme to resolve this question. BERT devlin2018bert is a kind of bidirectional encoder representations from transformers. It is designed to pre-train deep bidirectional representation and obtains new SoTA on some NLP tasks. Although these methods have achieved great results, a large number of parameters make it difficult for the model to be trained in limited resources. Transformer fail to generalize in many simple tasks, e.g. copying string and logical inference dehghani2018universal . Universal Transformers dehghani2018universal propose a self-attentive recurrent sequence model which addresses this problem. This methods can increase the training speed. In their work, authors following weight sharing found in CNNs and RNNs, extend the Transformer with a simple form of weight sharing that strikes an effective balance between induces and model expressivity. This methods also uses a large number of parameters.

Therefore, it is very important to consider how to reduce the amount of memory and computing they need. As we know, existing model compression methods are mainly divided into parameter pruning and share han2015learning , low rank approximation sainath2013low , knowledge transfer buci2006model , and transferred convolutional filters cohen2016group . Here, we mainly review some relevant compression methods. Tensor decomposition methods which adopts the idea of low rank approximation in most cases, have been successfully applied to neural networks compression. For example, in literature denton2014exploiting ; jaderberg2014speeding , researchers approximate a tensor by minimizing the reconstruction error of the original parameters on convolutional neural networks(CNNs). However, these approaches tend to accumulate errors when multiple layers are compressed sequentially, and the output feature maps deviate far from the original values with the increase of compressed layers. Our compression method uses the idea of parameters sharing in the constructing of attention layers, the size of output is same as the output form self-attention in Transformer which can effectively avoid these problems. Tensorizing Neural Networks novikov2015tensorizing have combined the idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor Train format oseledets2011tensor . This approach was later extended to convolutional garipov2016ultimate and recurrent neural networks yang2017tensor . Recently, in these work chen2018groupreduce ; variani2018west , researchers introduce efficient compression methods for the embedding and softmax layers based on structured low rank matrix approximation. TT-embedding khrulkov2019tensorized aims to compression the larger embedding layer on Transformer-XL dai2019transformer . Our method is different from these works, and combines two compression idea (low rank approximate and parameters sharing) to construct a tensorized Transformer.

In our work, we focus on the compression the multi-head attention in Transformer based the idea of parameters sharing. At the same time, we also combine low-rank approximate method to reduce parameters and time complexity.

5 Experiments

Transformer is a versatile and powerful modeling tool and widely is used in various natural language process tasks. In order to verify the effectiveness of our method (i.e., Multi-linear attention) replacing multi-head attention in Transformer, we carry out two NLP tasks named language modeling (LM) and neural machine translation (NMT). Complete code for running experiments will be released after the paper is accepted, while the key code which is about our method can be found in Supplementary Materials F.

5.1 Language Modeling

Language modeling is the task of predicting the next word in a sentence. This task is to estimate the joint probability p(s) of a sentence of tokens s=(w1,,wn). The resulting models can be used to generate text or further fine-tuned to solve other NLP tasks radford2018improving . In this paper, we employ the standard setting of predicting next token given the sequence of preceding tokens, based on the function p(s)=p(w1)i=2np(wi|w1,,wi-1). We chose three datasets in the order of small (i.e., PTB), medium (i.e., WikiText-103) and large (i.e., One-Billion). Models are evaluated based on Perplexity (PPL), which is the average per-word log-probability. The lower the PPL, the better the model is.

Specially, we take Transformer, the open source state-of-the art language modeling architecture, and replace the standard multi-head attention layers with our Multi-linear attention. Then, we test different model configurations on the PTB mikolov2011empirical , WikiText-103 merity2016pointer and One-Billion Word benchmark chelba2013one datasets and report the results in Table 1 and Table 2.

Table 1: Results (PPL) and model parameters with state-of-the-art results on One-Billion. Tensorized Transformer is our model. The core-1 is that the model use Single-block term tensor. Analogously, the core-2 is that two block term tensor is used.
Model Params Test PPL
RNN-1024+9 Gram chelba2013one 20B 51.3
LSTM-2018-512 jozefowicz2016exploring 0.83B 43.7
GCNN-14 bottleneck dauphin2017language 31.9
LSTM-8192-1024+CNN Input jozefowicz2016exploring 1.04B 30.0
High-Budget MoE shazeer2017outrageously 5B 28.0
LSTM+Mos yang2017breaking 113M 37.10
Transformer+adaptive input baevski2018adaptive 0.46B 23.7
Transformer-XL Base dai2019transformer 0.46B 23.5
Transformer-XL Large dai2019transformer 0.8B 21.8
Tensorized Transformer core-1 0.16B 20.5
Tensorized Transformer core-2 0.16B 19.5
Model PTB WikiText-103
Params Val PPL Test PPL Params Val PPL Test PPL
LSTM+augmented loss inan2016tying 24M 75.7 48.7 48.7
Variational RHN zoph2016neural 23M 67.9 65.4 45.2
4-layer QRNN merity2018analysis 151M 33.0
AWD-LSTM-MoS yang2017breaking 22M 58.08 55.97 29.0 29.2
Transformer+adaptive input baevski2018adaptive 24M 59.1 57 247M 19.8 20.5
Transformer-XL Standard dai2019transformer 24M 56.72 54.52 151M 23.1 24.0
Transformer-XL Large dai2019transformer 257M 18.3
Transformer-XL+TT khrulkov2019tensorized 18 M 57.9* 55.4* 130M 23.61* 25.70*
Tensorized Transformer core-1 12M 60.5 57.9 80.5M 22.7 20.9
Tensorized Transformer core-2 12M 54.25 49.8 86.5M 19.7 18.9
Table 2: Results and compression with state-of-the-art results on PTB and WikiText-103. ’-’ indicates no reported results in that setting, ’*’ indicates that the results is our own implementation.

5.2 Results and Details

PTB has 929k training tokens, 73k validation words, and 82k test words. The results is reported in Table 2. Similar to AWD-LSTM-MoS yang2017breaking , we apply variational dropout and weight average to our model (i.e., Tensorized Transformer). In addition, we need to state that, our model only replaces the multi-head attention using Multi-linear attention structure, and the other structures remain the same. We compare the result of our model with other models. Our model achieves the comparable results with SoTA when the number of core tensor is equal to two. However, our model size (i.e, model parameters) reduces by nearly half comparing with Transformer and Transformer-XL.

WikiText-103 contains 267,735 unique tokens. The dataset is available word-level language modeling benchmark with long-term dependency. It contains 103M training tokens from 28k articles, with an average length of 3.6k tokens per article, which allows testing the ability of long-term dependency modeling. Here, we set the sentence length is 100, which is different from the sentence length in PTB (30) and One-Billion (30). As shown in Table 2, our model reduces the previous SoTA perplexity form 20.5 to 18.9, which demonstrates the effectiveness of the proposed attention architecture.

The One-Billion Word benchmark is a large dataset derived from a news site. The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words. In this dataset, sentences are shuffled and hence the context is limited. Consequently, this dataset mainly tests the ability of modeling only short-term dependency. The comparison between Tensorized Transformer and the other methods are shown in Table 1. Although Tensorized Transformer is mainly designed to better compress Transformer or Transformer-XL model, it dramatically improves the single-model SoTA from 21.8 to 19.5. Specifically, Tensorized Transformer significantly outperforms a contemporary method using vanilla Transformers vaswani2017attention , suggesting that the advantage of the tensorized Transformer is also generalizable to modeling short sequences.

Table 2 and Table 1 show that our model get the lower PPL than other models in three datasets. An exciting observation is that our model has much fewer parameters. On One-Billion word benchmark and WikiText-103 dataset, we use the adaptive input method for input layer, and not on PTB dataset. The model of Transformer-XL+TT khrulkov2019tensorized is a recent compression model with Tensor Train to compress the input embedding layers only. The results in Table 2 show that compared with Transformer-XL+TT, our method has much fewer parameters, and better language modeling performance. These results verify that our model (i.e., Multi-linear attention) is effective in language modeling tasks, and has performed well for the model compression. Other details (such as hyperparameters and Hardware) can be found in Supplementary Materials E.

5.3 Neural Machine Translation

The goal is to map an input sequence s=(x1,x2,,xn) representing a phrase in one language, to an output sequence y=(y1,y2,,ym) representing the same phrase in a different language. In this task, we have trained the Transformer model vaswani2017attention on WMT 2016 English-German dataset sennrich2016edinburgh . Sentences were tokenized using the SentencePiece 22 2 https://github.com/google/sentencepiece. For our experiments, we have replaced each of the attention layers with Multi-linear attention. For evaluation we used beam search with a beam size of 5 and length penalty α=0.6. In this section, we only compared the results with Transformer vaswani2017attention . Our results are summarized in Table 3. * indicates that the result is our own implementation.

In Table 3, we select two baseline models. The Base-line sennrich2016edinburgh is first model in WMT 2016 English-German dataset. For the other baseline, we use the basic Transformer architecture vaswani2017attention . The BLEU score is 34.5 for the basic architecture. We carry out two tensorized Transformer structures, namely core-1 and core-2 respectively. When tensorized Transformer core-1 and core-2 are used, the BLEU scores are 34.10 and 34.91, which achieves better performance over Transformer. As for the reported model parameter size, our model uses less parameters.

Table 3: Results and compression with Transformer on WMT-16 English-to-German translation.
Model Params BLEU
Base-line sennrich2016edinburgh 26.8
Linguistic Input Featurec sennrich2016linguistic 28.4
Attentional encoder-decoder + BPE sennrich2016edinburgh 34.2
Transformer vaswani2017attention 52M 34.5*
Tensorized Transformer core-1 21M 34.10
Tensorized Transformer core-2 21.2M 34.91

6 Conclusion and Further Work

We have proposed a novel self attention encoder layer, namely the Multi-linear attention, to compress the original multi-head attention and derive a novel encoding scheme. Our main contribution lies in a structure of Tensorized Transformer based on Block-Term tensor decomposition which is represented by the combination of a group of 3-order tensors, with low-rank approximation and parameters sharing ideas adopted. Compared with existing Transformer based methods, our model achieved higher compression ratio and got better experimental results, particularly in language modeling task. These evidences imply that our method can potentially be further applied to more NLP tasks with limited resources.

In the future, we will continue to optimize the Tensorized Transformer framework and apply it in other NLP tasks. As we stated earlier, our model may suffer from overfitting when the number of cores is large in language modeling. In the future, we will explore the fundamental reasons that cause the problem and tackle them within the Tensorized Transformer framework.

References

  • [1] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
  • [2] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
  • [3] J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika, 35(3):283–319, 1970.
  • [4] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. Computer Science, 2013.
  • [5] Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. Groupreduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998, 2018.
  • [6] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999, 2016.
  • [7] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  • [8] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 933–941. JMLR. org, 2017.
  • [9] Lieven De Lathauwer. Decompositions of a higher-order tensor in block terms—part ii: Definitions and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033–1066, 2008.
  • [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. Published at ICLR2019, 2018.
  • [11] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
  • [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018.
  • [13] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
  • [14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
  • [15] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
  • [16] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • [17] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  • [18] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019.
  • [19] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • [20] Guangxi Li, Jinmian Ye, Haiqin Yang, Di Chen, Shuicheng Yan, and Zenglin Xu. Bt-nets: simplifying deep neural networks via block term decomposition. arXiv preprint arXiv:1712.05689, 2017.
  • [21] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018.
  • [22] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  • [23] Tomáš Mikolov, Anoop Deoras, Stefan Kombrink, Lukáš Burget, and Jan Černockỳ. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • [24] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in neural information processing systems, pages 442–450, 2015.
  • [25] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
  • [26] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.
  • [27] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.
  • [28] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6655–6659. IEEE, 2013.
  • [29] Rico Sennrich and Barry Haddow. Linguistic input features improve neural machine translation. arXiv preprint arXiv:1606.02892, 2016.
  • [30] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891, 2016.
  • [31] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • [32] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
  • [33] Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. West: Word encoded sequence transducers. arXiv preprint arXiv:1811.08417, 2018.
  • [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [35] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video classification. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3891–3900. JMLR. org, 2017.
  • [36] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
  • [37] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378–9387, 2018.
  • [38] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  • [39] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.

Appendix A Tensor and Tensor Slice

As introduced in  [39], a tensor and the tensor slice can be defined as follows.

Definition 1 (tensor).

Let D1, D2, , DNN denote index upper bounds. A tensor ARD1,,DN of order N is an N-way array where elements Ad1,d2,,dn are indexed by dn{1,2,,Dn} for 1nN.

The concept of tensor slice is specified as:

Definition 2 (tensor slice).

A tensor slice is a two-dimensional section (fragment) of a tensor, obtained by fixing all indexes except for two indexes.

Appendix B Theorem

Let 𝒆1,,𝒆n be basis vectors from the vector space S. Assume that these vectors 𝒆1,,𝒆n are linear independent. The output of self-attention function in in Eq. 2 ( in this paper) can be represented by a linear combination of a set of these basis vectors.

Attention(Q,K,V)=(𝒆1,,𝒆n)M, (9)

where Mn×d is a coefficient matrix, and d is a dimension of these matrices (i.e., Q,K, and V).

Proof.

If Q, K and V Span(𝒆1,,𝒆n), the linear combination representation of matrices Q,K and V can be written as follows:

{Q=(𝒆1,𝒆2,,𝒆n)(𝜶1,𝜶2,,𝜶d)K=(𝒆1,𝒆2,,𝒆n)(𝜷1,𝜷2,,𝜷d)V=(𝒆1,𝒆2,,𝒆n)(𝝃1,𝝃2,,𝝃d) (10)

The self-attention function is written as follows [34]:

Attention(Q,K,V)=softmax(QKTd)V, (11)

where QKT can be computed as follows:

QKT=(𝒆1,𝒆2,,𝒆n)(𝜶1,𝜶2,,𝜶d)(𝜷1,𝜷2,,𝜷d)T(𝒆1,𝒆2,,𝒆n)T (12)

As a result, the input of softmax function is a product of coefficient matrices (𝜶1,,𝜶d) and (𝜷1,,𝜷d)T. Then, we have

softmax(QKTd)=(𝒆1,,𝒆n)softmax(A/d)(𝒆1,,𝒆n)T (13)

where the matrix A is equal to (𝜶1,,𝜶d)(𝜷1,,𝜷d)T. Therefore, the attention representation can be written as follows:

softmax(QKTd)V =(𝒆1,𝒆2,,𝒆n)softmax(A/d)(𝝃1,𝝃2,,𝝃d) (14)
=(𝒆1,𝒆2,,𝒆n)M

where the matrix M is equal to softmax(A/d)(𝝃1,𝝃2,,𝝃d). The softmax(A/d) is to normalize the coefficient matrices of Q and K. It turns out that the output of the attention function [34] can be represented by a linear combination of the set of basic vectors. ∎

After the proof, it is helpful to describe the basic idea. First, we consider that the self-attention function can be linearly represented by a set of orthogonal basis vectors, when the input of softmax function is the product of two coefficient matrices, (𝜶1,𝜶2,,𝜶d) and (𝜷1,𝜷2,,𝜷d)T, respectively. Second, in constructing the multi-head mechanism, the matrices of basis vectors (𝒆1,𝒆2,,𝒆n) can be shared.

Appendix C Corollary

Under the same conditions as in Theorem 3.1. and the elements in each row of the matrix V are same, the Single-block attention representation Eq. 5 (in the paper) can reconstruct the Scaled Dot-Product attention in Eq. 2 (in the paper) by the summing over the tensor (i.e., the output of Single-block attention function) according to the second index. It holds that:

Attention(Q,K,V)i,m=j=1dAttenTD(𝒢;Q,K,V)i,j,m, (15)

where i, j and m are the indices of the Single-block attention output (i.e., a 3-order tensor), and d is the dimension for the second index. AttenTD() is the function of the Single-block attention based on Tucker decomposition. i and m are the indices of outputs (i.e., a matrix) from Eq. 2 (in the paper)

Proof.

In Theorem 3.1., we have proved the results about the attention function can be represented by a linear combination of basis vectors. Therefore, we can represent the self-attention function in Eq. 2 (in the paper) by the form as follows:

Attention(Q,K,V)=ΘQKTV (16)

where Θ is a normalization factor matrix, which can be used to replace the use of a sofmax function. We assume that Θ contains all the non-zero elements of the core tensor 𝒢. The self-attention in Eq. 2 (in the paper) can be re-written as follows:

Xi,m=k=1Nr=1RΘi,mQi,rKk,rVk,m (17)

where N is the length of a sentence, Xi,m=Attention(Q,K,V)i,m is the entry of the output from the self-attention, and R is equal to d. Here the core tensor 𝒢 is same as that in Eq. 7 (in the paper). Then, the Single-block attention (a 3-order tensor) can be represented as follows:

𝒜i,j,m=pRqRrR𝒢p,q,rQi,pKj,pVm,r (18)

where 𝒜 is a 3-order tensor, which is equal to AttenTD(𝒢;Q,K,V). Accordingly, 𝒜i,j,m is a entry in tensor 𝒜 and is equal to AttentionTD(𝒢;Q,K,V)i,j,m in Eq. 15. Next, we aim to prove Eq. 15 can be established. Therefore, we need to establish the relation between Eq. 18 and Eq. 17. Since the core tensor 𝒢 is a special tensor (i.e., diagonal tensor), Eq. 18 can be written as follows:

𝒜i,j,m=r=1R𝒢r,r,rQi,rKj,rVm,r (19)

After that, we can compute the attention representation through adding to model k. For better understanding, we give the graph representation in Figure 3.

Figure 3: Tensor 𝒜 is a 3-order tensor, which represents the Single-block attention in the left. 𝒜i,j,k is the entry of the tensor 𝒜. In the right, the graph represents that the summing of tensor slices which is from the tensor splitting in index j. This graph can help us to understand the main content of corollary 1.
Xi,m=r=1Rj=1N𝒢rrrQi,rKj,rVm,r

The corollary then holds. ∎

Appendix D Compression Ratio about Multi-Linear Attention

In order to compute the compression ratio, we need to compare multi-linear attention with multi-head attention. The comparison chart has been given in Figure 4.

Figure 4: A diagram about a comparison of parameters between multi-linear attention and multi-head attention.

In Figure 4, each Linear function in multi-head attention is about a weight matrix Wdmodel×d, and all weight matrices in multi-head attention are different. In multi-linear attention, three weight matrices are used and h (a number) weight vectors are used. Through the analysis about Figure 4, the compression ratio is computed as follows.

compressionratio=3×h×dmodel×d3×dmodel×d+h×dmodel=3×h×d3×d+h (20)

In practice, h is equal to 8 and d is equal to 512. The compression ratio approximates 8 in this case. In our work, the dimension of vector 𝒢r is set as R which is smaller than dmodel, where dmodel is the dimension of word vector.

Low-rank Approximation for Model Compression In this paper, we have described that our method combines two compression ideas, namely low-rank approximation and parameters sharing. Parameters sharing can be understood through the description of Figure 4. In Multi-linear attention, the idea of low-rank decomposition also has the function of model compression. We have proved that the Single-block attention can re-construct an one-head self-attention in Transformer. In order to obtain the representation of a tensorized attention, we adopt the tensor splitting and the concat function. After that, we consider that each tensor slice from tensor splitting approximates the output of the self-attention function Eq. 2 (in the paper). When we only focus on the idea of low-rank approximation, the compression ratio can be computed by the form, N×dN×N, where N is the length of a sequence, d is the dimension of a matrix (also namely hidden size). N is smaller than d, normally.

Through combining the ideas of parameters sharing and low-rank approximation, by formally considering the rank R, the compression ratio of Multi-linear attention model can be computed as follows:

compressionratioR=3×h×dmodel×d3×dmodel×d+R×h, (21)

where R is the rank of the core tensor 𝒢. The compression ratio will be larger when R is smaller. This compressionratioR is the compression ratio associated with R. R need to be set in practice. In experiments, R can be set to 18, which is smaller than dmodel.

Appendix E Experiment

E.1 Partial Structure about Tensorized Transformer

In this paper, the multi-linear attention is proposed. In order to show that the process of incorporating multi-linear attention into Transformer, Figure 5 gives out some information about the structure.

E.2 Experimental Details in Language Modeling

Figure 5: A diagram which is about the incorporating of multi-linear attention in partial Transformer structure. The parameters are shared in the constructing of each single-block attention.

Now, we report some details of experiments as a relevant supplementary material. Firstly, we use three weight matrices Wq,Wk and Wv to linearly project the queries, keys and values. The outputs from the linear projections can be shared by h times, where h is the number of core tensors in our background (i.e., core-1(h=1), core-2(h=2)). We use Block Term Tensor decomposition (BTD) to construct a new representation, namely Multi-linear attention, which is a 3-order tensor. For incorporating the proposed attention into the architecture of Transformer, we split the 3-order tensor, and then concat each matrix form the tensor. For other layers, we use the same structure as vanilla-Transformer.

Hardware

We trained our model on one machine with 2 NVIDIA P40 GPUs. For our base models, the hyperparameters are described in Table 4. In addition, we set the dropout=0.3 in all datasets. The model is trained using 30 epochs in three datasets (PTB, WikiText-103 and One-Billion).

Table 4: The hyperparameters in the Tensorized Transformers model
Datasets dhead dff h L dk dv R Test PPL
PTB 512 1024 2 6 40 40 10 49.8
WikiText-103 512 1024 2 6 100 100 18 18.9
One-Billion 1024 2000 2 6 40 40 18 19.5

Optimizer We used the Adam optimizer and vary the learning rate over the course of training. The vary formula [34] is follows in our work. We also used the warmup_steps=4000. Label Smoothing is employed with the value ϵ=0.1.

E.3 Experiment Details in Neural Machine Translation

The Tensorized Transformer also has been applied to Neural Machine Translation task. In this experiment, we use the same setup with Transformer [34], and replace the multi-head attention with the proposed multi-linear attention in the encoder structure. In the decoder structure, we still use the multi-head attention for verifying the effectiveness of encoding a sentence. The model is trained in 1 NVIDA P40 GPUs.

Appendix F Partial Code

The project have been achieved by pytorch. In this section, we give the partial code which is about our methods, i.e., Sing-block attention and Multi-linear attention. First, the class of Single-block attention is given as follows.

import torch
import torch.nn as nn
import torch.nn.init as init
import numpy as np
class SingleBlockAttention(nn.Module):
    ’’’Single block attention’’’
    def __init__(self, Rank):
        super(SingleBlockAttention, self).__init__()
        self.softmax = nn.Softmax()
        self.R = Rank
    def forward(self, q, k, v, mb_size,d):
        self.core = nn.Parameter(torch.FloatTensor(np.random.rand(self.R)))
        N = v.size(1)
        self.core = self.softmax(self.R)
        core_tensor = torch.zeros(N,d,N).cuda()
        for i in range(self.R):
            cores_tensor[i][i][i] = self.core[i]
        full_matrixs = []
        for i in range(mb_size):
            full_matrix_1 = torch.einsum(’pqk, ip,jq,kr->ijr’, [core_tensor, q[i], k[i], v[i]]).contiguous()
            full_matrixs.append(torch.sum(full_matrix_1, dim=1))
        output = torch.stack(full_matrixs).cuda().float()
        return output

Each Single block attention is a component of Multi-linear attention. Based on the Single block attention, the Multi-linear attention can be given as follows.

class MultiLinearAttention(nn.Module):
    ’’’ MultiLinearAttention ’’’
    def __init__(self, h, Rank, d, dropout=0.1):
        super(MultiLinearAttention, self).__init__()
        self.n_head = h # h is equal to 2 in our model
        self.d_k = d
        self.d_v = d
        self.w_q = nn.Parameter(torch.FloatTensor(d_model, d_k))
        self.w_k = nn.Parameter(torch.FloatTensor(d_model, d_k))
        self.w_v = nn.Parameter(torch.FloatTensor(d_model, d_v))
        self.Tattention = SingleCoreAttention(Rank)
        self.layer_norm = LayerNormalization(Rank)
        self.proj = Linear(self.n_head*d, Rank)
        self.dropout = nn.Dropout(dropout)
        init.xavier_normal_(self.w_q)
        init.xavier_normal_(self.w_k)
        init.xavier_normal_(self.w_v)
    def forward(self, q, k, v):
        d_k, d_v = self.d_k, self.d_v
        n_head = self.n_head
        residual = q
        mb_size, len_q, d_model = q.size()
        mb_size, len_k, d_model = k.size()
        mb_size, len_v, d_model = v.size()
        q_s = q.repeat(1, 1).view(-1, d_model)
        k_s = k.repeat(1, 1).view(-1, d_model)
        v_s = v.repeat(1, 1).view(-1, d_model)
        if n_head > 1:
          output_1 = self.Tattention(q_s, k_s, v_s, mb_size,d_v)
          output_2 = self.Tattention(q_s, k_s, v_s, mb_size,d_v)
          output = (output_1+output_2)*0.5
        else:
          ouput = self.Tattention(q_s, k_s, v_s, mb_size,d_v)
        # project back to residual size
        outputs = self.proj(outputs)
        outputs = self.dropout(outputs)
        return self.layer_norm(outputs + residual)