Abstract
Attention mechanism is a central component of the transformer architecturewhich led to the phenomenal success of large language models. However, thetheoretical principles underlying the attention mechanism are poorlyunderstood, especially its nonconvex optimization dynamics. In this work, weexplore the seminal softmax-attention model $f(\boldsymbol{X})=\langle\boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where$\boldsymbol{X}$ is the token sequence and$(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. Weprove that running gradient descent on $\boldsymbol{p}$, or equivalently$\boldsymbol{W}$, converges in direction to a max-margin solution thatseparates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearlyformalizes attention as an optimal token selection mechanism. Remarkably, ourresults are applicable to general data and precisely characterize$\textit{optimality}$ of tokens in terms of the value embeddings$\boldsymbol{Xv}$ and problem geometry. We also provide a broaderregularization path analysis that establishes the margin maximizing nature ofattention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditionsunder which the regularization paths directionally converge to their respectivehard-margin SVM solutions where $\boldsymbol{v}$ separates the input featuresbased on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, weverify our theoretical findings via numerical experiments and provide insights.