DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

Abstract

Variants dropout methods have been designed for the fully-connected layer,convolutional layer and recurrent layer in neural networks, and shown to beeffective to avoid overfitting. As an appealing alternative to recurrent andconvolutional layers, the fully-connected self-attention layer surprisinglylacks a specific dropout method. This paper explores the possibility ofregularizing the attention weights in Transformers to prevent differentcontextualized feature vectors from co-adaption. Experiments on a wide range oftasks show that DropAttention can improve performance and reduce overfitting.

Quick Read (beta)

loading the full paper ...