Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

Abstract

After their successful debut in natural language processing, Transformerarchitectures are now becoming the de-facto standard in many domains. Anobstacle for their deployment over new modalities is the architecturalconfiguration: the optimal depth-to-width ratio has been shown to dramaticallyvary across data types (e.g., $10$x larger over images than over language). Wetheoretically predict the existence of an embedding rank bottleneck that limitsthe contribution of self-attention width to the Transformer expressivity. Wethus directly tie the input vocabulary size and rank to the optimaldepth-to-width ratio, since a small vocabulary size or rank dictates an addedadvantage of depth over width. We empirically demonstrate the existence of thisbottleneck and its implications on the depth-to-width interplay of Transformerarchitectures, linking the architecture variability across domains to the oftenglossed-over usage of different vocabulary sizes or embedding ranks indifferent domains. As an additional benefit, our rank bottlenecking frameworkallows us to identify size redundancies of $25\%-50\%$ in leading NLP modelssuch as ALBERT and T5.

Quick Read (beta)

loading the full paper ...