Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

Abstract

This work presents an analysis of the effectiveness of using standard shallowfeed-forward networks to mimic the behavior of the attention mechanism in theoriginal Transformer model, a state-of-the-art architecture forsequence-to-sequence tasks. We substitute key elements of the attentionmechanism in the Transformer with simple feed-forward networks, trained usingthe original components via knowledge distillation. Our experiments, conductedon the IWSLT2017 dataset, reveal the capacity of these "attentionlessTransformers" to rival the performance of the original architecture. Throughrigorous ablation studies, and experimenting with various replacement networktypes and sizes, we offer insights that support the viability of our approach.This not only sheds light on the adaptability of shallow feed-forward networksin emulating attention mechanisms but also underscores their potential tostreamline complex architectures for sequence-to-sequence tasks.

Quick Read (beta)

loading the full paper ...