On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Abstract

Transformer models have emerged as fundamental tools across variousscientific and engineering disciplines, owing to their outstanding performancein diverse applications. Despite this empirical success, the theoreticalfoundations of Transformers remain relatively underdeveloped, particularly inunderstanding their training dynamics. Existing research predominantly examinesisolated components--such as self-attention mechanisms and feedforwardnetworks--without thoroughly investigating the interdependencies between thesecomponents, especially when residual connections are present. In this paper, weaim to bridge this gap by analyzing the convergence behavior of a structurallycomplete yet single-layer Transformer, comprising self-attention, a feedforwardnetwork, and residual connections. We demonstrate that, under appropriateinitialization, gradient descent exhibits a linear convergence rate, where theconvergence speed is determined by the minimum and maximum singular values ofthe output matrix from the attention layer. Moreover, our analysis reveals thatresidual connections serve to ameliorate the ill-conditioning of this outputmatrix, an issue stemming from the low-rank structure imposed by the softmaxoperation, thereby promoting enhanced optimization stability. We also extendour theoretical findings to a multi-layer Transformer architecture, confirmingthe linear convergence rate of gradient descent under suitable initialization.Empirical results corroborate our theoretical insights, illustrating thebeneficial role of residual connections in promoting convergence stability.

Quick Read (beta)

loading the full paper ...