Abstract
We revisit the design choices in Transformers, and propose methods to addresstheir weaknesses in handling long sequences. First, we propose a simple layernamed gated attention unit, which allows the use of a weaker single-headattention with minimal quality loss. We then propose a linear approximationmethod complementary to this new layer, which is accelerator-friendly andhighly competitive in quality. The resulting model, named FLASH, matches theperplexity of improved Transformers over both short (512) and long (8K) contextlengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and12.1$\times$ on PG-19 for auto-regressive language modeling, and 4.8$\times$ onC4 for masked language modeling.
Quick Read (beta)
loading the full paper ...