Transformer Meets Twicing: Harnessing Unattended Residual Information

Abstract

Transformer-based deep learning models have achieved state-of-the-artperformance across numerous language and vision tasks. While the self-attentionmechanism, a core component of transformers, has proven capable of handlingcomplex data patterns, it has been observed that the representational capacityof the attention matrix degrades significantly across transformer layers,thereby hurting its overall performance. In this work, we leverage theconnection between self-attention computations and low-pass non-local means(NLM) smoothing filters and propose the Twicing Attention, a novel attentionmechanism that uses kernel twicing procedure in nonparametric regression toalleviate the low-pass behavior of associated NLM smoothing with compellingtheoretical guarantees and enhanced adversarial robustness. This approachenables the extraction and reuse of meaningful information retained in theresiduals following the imperfect smoothing operation at each layer. Ourproposed method offers two key advantages over standard self-attention: 1) aprovably slower decay of representational capacity and 2) improved robustnessand accuracy across various data modalities and tasks. We empiricallydemonstrate the performance gains of our model over baseline transformers onmultiple tasks and benchmarks, including image classification and languagemodeling, on both clean and corrupted data.

Quick Read (beta)

loading the full paper ...