### Abstract

Variants dropout methods have been designed for the fully-connected layer,convolutional layer and recurrent layer in neural networks, and shown to beeffective to avoid overfitting. As an appealing alternative to recurrent andconvolutional layers, the fully-connected self-attention layer surprisinglylacks a specific dropout method. This paper explores the possibility ofregularizing the attention weights in Transformers to prevent differentcontextualized feature vectors from co-adaption. Experiments on a wide range oftasks show that DropAttention can improve performance and reduce overfitting.

### Quick Read (beta)

# DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

###### Abstract

Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.

DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

Lin Zehui
Fudan University
[email protected]
Pengfei Liu ^{†}^{†}thanks: Co-mentoring
Fudan University
[email protected]
Luyao Huang
Fudan University
[email protected]
Jie Fu
MILA
[email protected]
Junkun Chen
Fudan University
[email protected]
Xipeng Qiu ^{†}^{†}thanks: Corresponding author
Fudan University
[email protected]
Xuanjing Huang
Fudan University
[email protected]

noticebox[b]Preprint. Under review.\[email protected]

## 1 Introduction

As an effective and easy-to-implement regularization method, Dropout has been first designed for fully-connected layers in neural models Srivastava et al. (2014). Over the past few years, a host of variants of dropout have been introduced. For recurrent neural networks (RNNs), dropout is only applied to the input layers before the successful attempt in Krueger et al. (2016); Semeniuta et al. (2016); Gal and Ghahramani (2016). Also, a dozen of dropout methods for convolutional neural networks (CNNs) have been proposed in Tompson et al. (2015); Huang et al. (2016); Larsson et al. (2016); Gastaldi (2017); Ghiasi et al. (2018); Zoph et al. (2018); Yamada et al. (2018). On the other hand, fully-connected self-attention neural networks, such as Transformers Vaswani et al. (2017), have emerged as a very appealing alternative to RNNs and CNNs when dealing with sequence modelling tasks.

Although Transformers incorporate dropout operators in their architecture, the regularization effect of dropout in the self-attention has not been thoroughly analyzed in the literature.

The success of the adaption of dropout for fully-connected, convolutional and recurrent layers gives us a tantalizing hint that a more specific dropout for self-attentional operators might be needed.
Additionally, the original publicly code^{1}^{1}
1
https://github.com/tensorflow/tensor2tensor of Transformer Vaswani et al. (2017) with Dropout trick also provides the evidence for this work, although it’s less understood why it works and how it might be extended.
In this paper, we demonstrate the benefit of dropout in self-attention layers (DropAttention) with two key distinctions compared with the dropout used in fully-connected layers and recurrent layers.
The first is that DropAttention randomly sets attention weights to zero, which can be interpreted as dropping a set of neurons along different dimensions.
Specifically, DropAttention aims to encourage the model to utilize the full context of the input sequences rather than relying solely on a small piece of features.
For example, for sentiment classification, the prediction is usually dominated by one or several emotional words, ignoring other informative patterns. This can make the model overfit some specific patterns.
In fully-connected and recurrent layers, dropout discourages the complex co-adaptation of different units in the same layer, while DropAttention prevents different contextualized feature vectors from co-adapting, learning features which are generally helpful for task-specific prediction.
Secondly, in addition to dropping out individual attentional units, we also explore the possibility of operating in contiguous regions.
It is inspired by DropBlock Ghiasi et al. (2018) where units in a contiguous region of a convolutional feature map are discarded together. It is a more effective way of dropping for attention layers, since a semantic unit are usually composed of several spatially consecutive words.
Experiments on a wide range of tasks with different-scale datasets show that DropAttention can improve performance and reduce overfitting.

## 2 Related Work

Methods | Dropped Objects | Spaces | Layers |
---|---|---|---|

Dropout Srivastava et al. (2014) | Unit | Hidden | FCN |

DropConnect Wan et al. (2013) | Weight | Hidden | FCN |

SpatialDropout Kalchbrenner et al. (2014) | Unit | Hidden | CNN |

Cutout DeVries and Taylor (2017) | Unit | Input | CNN |

DropEmb Gal and Ghahramani (2016) | Weight | Input | Lookup |

Variational Dropout (Gal and Ghahramani, 2016) | Unit | Hidden | RNN |

Zoneout Krueger et al. (2016) | Unit | Hidden | RNN |

DropBlock Ghiasi et al. (2018) | Region of Units | Hidden | CNN |

DropAttention | Region of Weights | Input& Hidden | Self-Attention |

We present a summary of existing models by highlighting differences among dropped object, spaces and layers as shown in Table 1. The original idea of Dropout is proposed by Srivastava et al. (2014) for fully-connected networks, which is regarded as an effective regularization method. After that, many dropout techniques for specific network architectures, such as CNNs and RNNs, have been proposed. For CNNs, most successful methods require the noise to be structured Tompson et al. (2015); Huang et al. (2016); Larsson et al. (2016); Gastaldi (2017); Ghiasi et al. (2018); Zoph et al. (2018); Yamada et al. (2018). For example, SpatialDropoutKalchbrenner et al. (2014) is used to address the spatial correlation problem. DropConnect Wan et al. (2013) sets a randomly selected subset of weights within the network to zero. For RNNs, Variational Dropout Gal and Ghahramani (2016) and ZoneOut Krueger et al. (2016) are most widely used methods. In Variational Dropout, dropout rate is learned and the same neurons are dropped at every timestep. In ZoneOut, it stochastically forces some hidden units to maintain their previous values instead of dropping. Different from these methods, in this paper, we explore how to drop information on self-attention layers.

## 3 Background

### 3.1 Transformer Architecture

The typical fully-connected self-attention architecture is the Transformer Vaswani et al. (2017), which uses the scaled dot-product attention to model the intra-interactions of a sequence. Given a sequence of vectors $H\in {\mathbb{R}}^{l\times d}$, where $l$ and $d$ represent the length of and the dimension of the input sequence, the self-attention projects $H$ into three different matrices: the query matrix $Q$, the key matrix $K$ and the value matrix vector $V$, and uses scaled dot-product attention to get the output representation.

$Q,K,V=H{W}^{Q},H{W}^{K},H{W}^{V}$ | (1) | ||

$\mathrm{Attn}(Q,K,V)=\mathrm{softmax}({\displaystyle \frac{Q{K}^{T}}{\sqrt{{d}_{k}}}})V,$ | (2) |

where ${W}^{Q},{W}^{K},{W}^{V}\in {\mathbb{R}}^{d\times {d}_{k}}$ are learnable parameters and $\mathrm{softmax}()$ is performed row-wise.

To enhance the ability of self-attention, multi-head self-attention is introduced as an extension of the single head self-attention, which jointly model the multiple interactions from different representation spaces,

$\mathrm{MultiHead}(H)=[{\text{head}}_{1};\mathrm{\dots};{\text{head}}_{k}]{W}^{O},$ | (3) | ||

$\mathrm{where}\mathit{\hspace{1em}}{\text{head}}_{i}=\mathrm{Attn}(H{W}_{i}^{Q},H{W}_{i}^{K},H{W}_{i}^{V}),$ | (4) |

where ${W}^{O},{W}_{i}^{Q},{W}_{i}^{K},{W}_{i}^{V}(i\in [1,h])$ are learnable parameters. Transformer consists of several stacked multi-head self-attention layers and fully-connected layers. Assuming the input of the self-attention layer is $H$, its output $\stackrel{~}{H}$ is calculated by

$Z=$ | $H+\mathrm{MultiHead}(\text{layer-norm}(H)),$ | (5) | ||

$\stackrel{~}{H}=$ | $Z+\mathrm{MLP}(\text{layer-norm}(Z)),$ | (6) |

where $\mathrm{layer}-\mathrm{norm}(\cdot )$ represents the layer normalization Ba et al. (2016) .

Besides, since the self-attention ignores the order information of a sequence, a positional embedding $PE$ is used to represent the positional information.

### 3.2 Dropout

Dropout Srivastava et al. (2014) is a popular regularization form for fully-connected neural network. It breaks up co-adaptation between units, therefore it can significantly reduce overfitting and improve test performance. Each unit of a hidden layer ${\mathbf{h}}^{(l)}\in {\mathbb{R}}^{d}$ is dropped with probability $p$ by setting it to 0.

${\mathbf{h}}^{(l+1)}=f(W\mathbf{m}\odot {\mathbf{h}}^{(l)}),$ | (7) |

where $\mathbf{m}\in {\{0,1\}}^{d}$ is a binary mask vector with each element $j$ drawn independently from ${m}_{j}\sim Bernoulli(p)$, and $\odot $ denotes element-wise production.

DropConnect Wan et al. (2013) is a generalization of Dropout. It randomly drops hidden layers weights instead of units. Assume $M$ is a binary mask matrix drawn from ${M}_{ij}\sim Bernoulli(p)$, $W$ is the hidden layer weights. Then DropConnect can be formulated as,

${\mathbf{h}}^{(l+1)}=f((W\odot M){\mathbf{h}}^{(l)})$ | (8) |

Dropout essentially drops the entire column of the weight matrix. Therefore, Dropout can be regarded as a special case of DropConnect, where a whole column weight is dropped.

Since Dropout and DropConnect achieve great success on fully-connected layer, a natural motivation is whether a specific dropout method is needed for the fully-connected self-attention networks. Experiments conducted shows that a new dropout method designed for fully-connected self-attention networks can also reduce overfitting and obtain improvements.

## 4 DropAttention

In this section, we will introduce our attention regularization method: DropAttention.

Given a sequence of vectors $H\in {\mathbb{R}}^{l\times d}$, the fully-connected self-attention layer can be reformulated into

$\stackrel{~}{H}=f(\mathrm{\Lambda}V),$ | (9) |

where $\mathrm{\Lambda}=\mathrm{softmax}(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}})$, $f(\cdot )$ is a residual nonlinear function defined by Eq. (6) and $Q,K,V$ is calculated by Eq. (1).

The output of $i$-th position is

${\stackrel{~}{\mathbf{h}}}_{i}=f({\displaystyle \sum _{j=1}^{l}}{\lambda}_{ij}{\mathbf{v}}_{j}),$ | (10) |

where ${\stackrel{~}{\mathbf{h}}}_{i}$ is the $i$-th row vector of $\stackrel{~}{H}$ and ${\mathbf{v}}_{j}$ is the $j$-th row vector of $V$. ${\lambda}_{ij}$ is the entry of $\mathrm{\Lambda}$.

With this formulation, we can connect the self-attention layer to the fully-connected layer with two differences. The first difference is the weight matrix $\mathrm{\Lambda}$ is dynamically generated. The second difference is that the basic unit is a vector rather than a neuron.

Due to the similarity between fully-connected layer and self-attention layer, we can introduce the popular dropout methods for FCN to self-attention mechanism. In detail, we propose two dropout methods for the fully-connected self-attention layer: DropAttention(c) and DropAttention(e).

1) DropAttention(c) means to randomly drop “column” in attention weight matrix, which is a simple method similar to the standard Dropout Srivastava et al. (2014). We randomly drop the unit ${\mathbf{v}}_{j},1\le j\le l$. Note that ${\mathbf{v}}_{j}$ here is a vector instead of a single neuron.

2) DropAttention(e) means to randomly drop “element” in attention weight matrix, which is a more generalized method of the DropAttention(c). Similar to DropConnect Wan et al. (2013), DropAttention(e) randomly drops elements in attention weights matrix $\mathrm{\Lambda}$. DropAttention(c) can be regarded as a special case of DropAttention(e) in which a whole column of $\mathrm{\Lambda}$ is dropped.

Besides the basic dropping strategies, we also augment the DropAttentions with two functions.

### 4.1 Dropping Contiguous Region

Inspired by DropBlock Ghiasi et al. (2018), we drop contiguous region of the attention weights matrix instead of independent random units. The behind motivation is based on distributional hypothesis Harris (1954): words that are used and occur in the same contexts tend to purport similar meanings. In Transformer Vaswani et al. (2017) where multi-layer structure is used, when dropping independent random units, information correlated with the dropped input can still be restored in the next layer through surrounding words, which may cause the networks overfitting. Dropping the whole semantic unit consisting of several words can be a more effective way of dropping out.

Therefore, there are two hyperparameters in DropAttention: window size $w$ and drop rate $p$. The window size $w$ is the length of contiguous region to be dropped, while $p$ controls how many units to drop. In standard Dropout Srivastava et al. (2014), the binary mask is sampled from the Bernoulli distribution with the probability of $p$. Since DropAttention will expand every zero entry in the binary mask to be window with size $w$. Therefore, we just require to drop $p/w$ windows.

### 4.2 Normalized Rescaling

To ensure that the sum of attention weights to remain 1 after applying DropAttention, we re-normalize the attention weights after dropout operations. While traditional Dropout also has rescaling operation where neuron weights are divided by $1-p$, there is no guarantee that the sum of attention weights after rescaling remains 1. Experiments on classification task (see sec. 5.1) show that DropAttention with normalized rescaling outperforms traditional dropout rescaling. And in practice with normalized rescale, training process can be more steady compared to traditional rescaling.

Figure 1 shows two proposed DropAttention methods. The Pseudocode of DropAttention(e) is described in Algorithm 4.2. DropAttention(c) is adopted in the similar way to DropAttention(e).

[t] \[email protected]@algorithmic[1] \RequireAttention weight matrix $\mathrm{\Lambda}$, window size $w$, drop rate $p$ \IfInference \Statereturn $\mathrm{\Lambda}$ \EndIf\State$\gamma =p/w$; \StateSample mask matrix $M$ randomly, where ${M}_{ij}\sim Bernoulli(\gamma )$; \StateFor each zero position ${M}_{ij}$, expand the mask with the span length of $w$, ${M}_{i,j}:{M}_{i,j+w-1}$, and set all the values in the window to be $0$; \StateApply mask: $\mathrm{\Lambda}=M\odot \mathrm{\Lambda}$; \ForAllrow vector of $\mathrm{\Lambda}$: ${\lambda}_{j}$ \StateNormalized rescale: ${\lambda}_{j}={\lambda}_{j}/sum({\lambda}_{j})$ \EndFor

## 5 Experiment

We evaluate the effectiveness of DropAttentions on 4 different tasks: Text Classification, Sequence Labeling, Textual Entailment and Machine Translation. We also conduct a set of analytical experiments to validate properties of the networks.

### 5.1 Text Classification

We first evaluate the effectiveness of DropAttention on a couple of classification datasets ranging from small, medium and large scale. Statistics of datasets are listed in Table 2. All datasets are split into training, development and testing sets.

Dataset | #classes | #documents |

CR | 2 | 3,993 |

QC | 6 | 5,052 |

SUBJ | 2 | 10,000 |

MR | 2 | 10,661 |

AG’s News | 4 | 127,600 |

Yelp2013 | 5 | 335,018 |

Yelp13 reviews: collected from the Yelp Dataset Challenge in 2013, which have 5 levels of ratings from 1 to 5. We use the same Yelp datasets slitted and tokenized in Tang et al. (2015). MR: Movie reviews with two classes Pang and Lee (2005). SUBJ: Subjectivity dataset containing subjective and objective instance. It is also a 2 classes dataset Pang and Lee (2004). CR: Customer reviews of various products with positive and negative sentiments. AG’s News: A news topic classification with 4 classes created by Zhang et al. (2015). QC: The TREC questions dataset involves six different question types Li and Roth (2002).

Model | Norm? | CR | SUBJ | MR | QC | AG’s News | Yelp13 |

w/o DropAttention | 80.00 | 93.30 | 76.92 | 88.40 | 88.13 | 61.49 | |

p=0.4,w=2 | p=0.2,w=3 | p=0.3,w=2 | p=0.3 w=1 | p=0.4 w=1 | p=0.4 w=1 | ||

DropAttention(c) | Y | 82.75 | 94.10 | 78.80 | 90.80 | 88.87 | 62.34 |

N | 78.25 | 93.10 | 77.30 | 89.60 | 88.49 | 62.27 | |

p=0.2,w=3 | p=0.3,w=2 | p=0.3,w=2 | p=0.3,w=2 | p=0.2,w=2 | p=0.2 w=1 | ||

DropAttention(e) | Y | 81.25 | 93.50 | 78.51 | 89.60 | 88.66 | 61.79 |

N | 81.25 | 93.50 | 75.33 | 88.80 | 88.47 | 61.46 |

Detail model configurations are given in Appendix. We use accuracy as evaluation metrics. Results of all datasets are listed in Table 3. It shows that DropAttentions can significantly improve performance on a wide range of datasets of small, medium and large scale. Besides, note that when comparing normalized rescaling with traditional rescaling under the same DropAttention hyperparameters, Table 3 shows that normalized rescaling can generally obtain better performance.

For classification tasks, we find that larger dropout rate and smaller window size are preferred for DropAttention(c) while smaller dropout rate and larger window size are preferred for DropAttention(e). And DropAttention(c) can generally obtain higher performances than DropAttention(e) in classification tasks.

### 5.2 Sequence Labeling

We also evaluate the effectiveness of DropAttention on sequence labeling. We conducted experiments by following the same settings as Yang et al. (2016). We use the following benchmark datasets in our experiments: Penn Treebank (PTB) POS tagging, CoNLL 2000 chunking, CoNLL 2003 English NER. The statistics of the datasets are described in Table 4.

Dataset | Task | Train | Dev. | Test |
---|---|---|---|---|

CoNLL 2000 | Chunking | 211,727 | - | 47,377 |

CoNLL 2003 | NER | 204,567 | 51,578 | 46,666 |

PTB | POS | 912,344 | 131,768 | 129,654 |

We process sentences with Transformer encoder. After encoding, we feed the output vector into a fully-connected layer. Detail model hyperparameters are given in Appendix.

Results are shown in Table 6. In Table 6, all best results are under the hyperparameters of $p=0.3,w=3$ except for DropAttention(e) in POS task with $p=0.2,w=2$. It shows that both DropAttention(c) and DropAttention(e) can obtain significant improvements. Our model achieve 0.29 accuracy, 0.40 F1 score, 1.76 F1 score improvements in POS, NER and Chunking respectively. And we find that larger dropout rate and larger window size are generally preferred.

Transformer | POS | NER | Chunking |
---|---|---|---|

w/o DropAttention | 95.92 | 87.23 | 89.09 |

p=0.3 w=3 | p=0.3 w=3 | p=0.3 w=3 | |

DropAttention(c) | 96.21 | 88.51 | 90.56 |

p=0.2,w=2 | p=0.3,w=3 | p=0.3,w=3 | |

DropAttention(e) | 96.17 | 88.63 | 90.85 |

Model | HyperParam | BLEU | |
---|---|---|---|

p | w | ||

w/o DropAttention | 0 | 0 | 27.30 |

DropAttention(c) | 0.1 | 1 | 27.96 |

0.1 | 2 | 27.87 | |

0.1 | 3 | 27.98 | |

0.2 | 1 | 27.87 | |

0.2 | 2 | 28.04 | |

0.2 | 3 | 27.95 | |

DropAttention(e) | 0.1 | 1 | 28.16 |

0.1 | 2 | 28.03 | |

0.1 | 3 | 28.07 | |

0.2 | 1 | 27.92 | |

0.2 | 2 | 28.32 | |

0.2 | 3 | 27.87 |

### 5.3 Textual Entailment

We use the biggest textual entailment dataset, SNLI Bowman et al. (2015) corpus to evaluate the effectiveness of DropAttention on this task. SNLI is a collection of sentence pairs labeled for entailment, contradiction, and semantic independence. A pair of sentences called premise and hypothesis will be fed to the model, and the model will be asked to tell the relation of two sentences. It is also a classification task, and we measure the performance by accuracy.

We process the hypothesis and premise with the same Transformer encoder, which means that the hypothesis encoder and the premise encoder share the same parameters. We use max pooling to create a simple vector representation from the output of transformer encoder. After processing two sentences respectively, we use the two outputs to construct the final feature vector, which consisting of the concatenation of two sentence vectors, their difference, and their elementwise product Bowman et al. (2016). We then feed the final feature vector into a 2-layer ReLU MLP to map the hidden representation into classification result. Detail model hyperparameters are given in Appendix.

Results are listed in Table 8. For full results with different hyperparameters please refer to Appendix. Experiments show that DropAttention(c) and DropAttention(e) can significantly improve performances.

### 5.4 Machine Translation

We further demonstrate the effectiveness of DropAttention on translation tasks. We conduct experiments on WMT’ 16 En-De dataset which consists of 4.5M sentence pairs. We follow Ott et al. (2018) by reusing the preprocessed data, where Ott et al. (2018) validates on newstest13 and tests on newstest14, and uses a vocabulary of 32K symbols based on a joint source and target byte pair encoding (BPE; Sennrich et al. (2015)). We measure case-sensitive tokenized BLEU.
We use the fairseq-py toolkit ^{2}^{2}
2
https://github.com/pytorch/fairseq re-implementation of Transformer Vaswani et al. (2017) model. We follow the configuration of original Transformer base model Vaswani et al. (2017). See detail configuration in Appendix. DropAttention with different hyperparameters is applied to attention weights.

Table 6 shows the BLEU score for DropAttention with different hyperparameters. The results show that DropAttention can generally obtain higher performance compared with baseline without DropAttention. With DropAttention(e) of $p=0.2,w=2$, the model can outperform the baseline by a large margin, reaching a BLEU score of 28.32. For DropAttention(c), the model also reaches the best BLEU score with $p=0.2,w=2$.

There are two insights from this experiment. The first is that a regularization of self-attention works to improve the generalization ability even for the large-scale data. The second is that the DropAttention is complementary to the standard dropout.

### 5.5 Complementarity to stardard Dropout

We also explore the effect of DropAttention combining with standard Dropout. We conduct experiments on classification tasks and machine translation tasks. We choose AG’s News as classification dataset and WMT’ 16 En-De as Machine Translation dataset. Same hyperparameters as 5.1 and 5.4 are used. Table 8 shows that when combining DropAttention with Dropout, models can obtain higher performances compared to implementing Dropout or DropAttention alone. It implys that DropAttention is complementary to stardard Dropout.

Transformer | SNLI |
---|---|

w/o DropAttention | 83.36 |

p=0.2 w=3 | |

DropAttention(c) | 84.38 |

p=0.5,w=1 | |

DropAttention(e) | 84.48 |

Transformer | Classification | MT |
---|---|---|

baseline | 88.13 | 25.42 |

+ Standard Dropout | 88.43 | 27.3 |

+ DropAttention | 88.50 | 26.3 |

+ Dropout+DropAttention | 88.70 | 28.32 |

## 6 Analysis

In this section, we study the impact of DropAttention on the behavior of model quantitatively. We use three metrics to evaluate the model based on the attention weights: Div, Disagreement and Entropy.

Div Suppose $A$ is the attention weights matrix, where every row i corresponds to the attention weights vector produced by the ${i}_{th}$ attention head. Div is defined as,

$$\text{Div}={\parallel \left(A{A}^{T}-I\right)\parallel}_{F}^{2},$$ | (11) |

where $\parallel \cdot {\parallel}_{F}$ represents the Frobenius norm of a matrix and $I$ stands for identity matrix. It was first introduced by Lin et al. (2017) as a penalization term which encourages the diversity of weight vectors across different heads of attention. If Div gets large, it means multi-heads attention weights distributions have large overlap.

Disagreement We use the same notations above. ${A}^{i}$ stands for the ${i}_{th}$ row of the attention matrix, then the Disagreement is expressed as,

$$\text{Disagreement}=\frac{1}{{h}^{2}}\sum _{i=1}^{h}\sum _{j=1}^{h}\frac{{A}^{i}\cdot {A}^{j}}{\parallel {A}^{i}\parallel \parallel {A}^{j}\parallel},$$ | (12) |

where $h$ denotes the number of heads. It was proposed by Li et al. (2018), which also expects to encourage the diversity of the model. The Disagreement is defined as calculating the cosine similarity $\mathrm{cos}(\cdot )$ between the attention weights vector pair produced by two different heads. The smaller score is, the more diverse different attention heads are.

Entropy is used to evaluate the diversity within one head. ${A}_{j}^{i}$ is the ${j}_{th}$ element of the attention weights vector produced by ${i}_{th}$ head. Entropy of attention weights is defined as,

$${\text{E}}_{i}=-\sum _{j}{A}_{j}^{i}\mathrm{log}{A}_{j}^{i}.$$ | (13) |

If entropy gets small, it represents that the head focus on a small fraction of words.

### 6.1 Effect on Intra-Diversity

We first observe the impact of DropAttention on intra-diversity, namely attention distribution within one head. Figure (a)a shows the multi-head entropy of models for classification task. When the drop rate and window size increasing, the entropy increase accordingly. This suggests DropAttention can effectively smoothen the attention distribution, making the model utilize more context. This can subsequently increase robustness of the model.

### 6.2 Effect on Inter-Diversity

We further study the impact of DropAttention on inter-diversity, namely the difference between multi heads. Figure (b)b and (c)c show the Disagreement and Diveristy of multi heads, respectively. It shows that with larger drop rate and window size, Div and Disagreement are larger accordingly. Note that large Diversity and Disagreement means that the difference of attention distribution between heads is small. This is due to the smoother attention distribution within one head. With less sharply different multi-heads, the model does not have to rely on a single head to make predictions, which means that all heads have a smoother contribution to the final predictions. This can increase robustness of the model.

### 6.3 Effect on Sparsity

Similar to Srivastava et al. (2014), we also observe the effect of DropAttention on sparsity. Since the attention weights are summed up to 1, we only collect the largest attention weights of all heads. To eliminate the effect of sentence length, attention weights are multiplied by the sentence length. Figure 3 shows the distribution of largest attention weights, where model with DropAttention has smaller attention weights compared to model without DropAttention. This phenomenon is consistent with Srivastava et al. (2014) where model with dropout tends to allocate smaller activation weights compared to model without dropout.

## 7 Conclusion and Discussion

In this paper, we introduce DropAttention, a variant of Dropout designed for fully-connected self-attention network. Experiments on a wide range of tasks demonstrate that DropAttention is an effective technique for improving generalization and reducing overfitting of self-attention networks. Several analytical statistics give the intuitive impacts of DropAttention, which show that applying DropAttention can help model utilize more context, subsequently increasing robustness.

## References

- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
- Bowman et al. (2016) Samuel R Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D Manning, and Christopher Potts. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021, 2016.
- DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.
- Gastaldi (2017) Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
- Ghiasi et al. (2018) Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10750–10760, 2018.
- Harris (1954) Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
- Huang et al. (2016) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. In Proceedings of ACL, 2014.
- Krueger et al. (2016) David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
- Larsson et al. (2016) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
- Li and Roth (2002) Xin Li and Dan Roth. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002.
- Li et al. (2018) Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu, and Tong Zhang. Multi-head attention with disagreement regularization. arXiv preprint arXiv:1810.10183, 2018.
- Lin et al. (2017) Zhouhan Lin, Mo Feng, Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
- Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. CoRR, abs/1806.00187, 2018.
- Pang and Lee (2004) Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics, 2004.
- Pang and Lee (2005) Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 115–124. Association for Computational Linguistics, 2005.
- Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118, 2016.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1422–1432, 2015.
- Tompson et al. (2015) Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
- Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
- Yamada et al. (2018) Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba, and Koichi Kise. Shakedrop regularization for deep residual learning. arXiv preprint arXiv:1802.02375, 2018.
- Yang et al. (2016) Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270, 2016.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
- Zoph et al. (2018) Barret Zoph, Vijay VasudSentiment analysis using subjectivity summarizationevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.