Pay Attention when Required

  • 2020-09-09 19:39:15
  • Swetha Mandava, Szymon Migacz, Alex Fit Florea
  • 36

Abstract

Transformer-based models consist of interleaved feed-forward blocks - thatcapture content meaning, and relatively more expensive self-attention blocks -that capture context meaning. In this paper, we explored trade-offs andordering of the blocks to improve upon the current Transformer architecture andproposed PAR Transformer. It needs 35% lower compute time than Transformer-XLachieved by replacing ~63% of the self-attention blocks with feed-forwardblocks, and retains the perplexity on WikiText-103 language modellingbenchmark. We further validated our results on text8 and enwiki8 datasets, aswell as on the BERT model.

 

Quick Read (beta)

loading the full paper ...