LambdaNetworks: Modeling Long-Range Interactions Without Attention

Abstract

We present lambda layers -- an alternative framework to self-attention -- forcapturing long-range interactions between an input and structured contextualinformation (e.g. a pixel surrounded by other pixels). Lambda layers capturesuch interactions by transforming available contexts into linear functions,termed lambdas, and applying these linear functions to each input separately.Similar to linear attention, lambda layers bypass expensive attention maps, butin contrast, they model both content and position-based interactions whichenables their application to large structured inputs such as images. Theresulting neural network architectures, LambdaNetworks, significantlyoutperform their convolutional and attentional counterparts on ImageNetclassification, COCO object detection and COCO instance segmentation, whilebeing more computationally efficient. Additionally, we design LambdaResNets, afamily of hybrid architectures across different scales, that considerablyimproves the speed-accuracy tradeoff of image classification models.LambdaResNets reach excellent accuracies on ImageNet while being 3.2 - 4.4xfaster than the popular EfficientNets on modern machine learning accelerators.When training with an additional 130M pseudo-labeled images, LambdaResNetsachieve up to a 9.5x speed-up over the corresponding EfficientNet checkpoints.

Quick Read (beta)

loading the full paper ...