Patches Are All You Need?

Abstract

Although convolutional networks have been the dominant architecture forvision tasks for many years, recent experiments have shown thatTransformer-based models, most notably the Vision Transformer (ViT), may exceedtheir performance in some settings. However, due to the quadratic runtime ofthe self-attention layers in Transformers, ViTs require the use of patchembeddings, which group together small regions of the image into single inputfeatures, in order to be applied to larger image sizes. This raises a question:Is the performance of ViTs due to the inherently-more-powerful Transformerarchitecture, or is it at least partly due to using patches as the inputrepresentation? In this paper, we present some evidence for the latter:specifically, we propose the ConvMixer, an extremely simple model that issimilar in spirit to the ViT and the even-more-basic MLP-Mixer in that itoperates directly on patches as input, separates the mixing of spatial andchannel dimensions, and maintains equal size and resolution throughout thenetwork. In contrast, however, the ConvMixer uses only standard convolutions toachieve the mixing steps. Despite its simplicity, we show that the ConvMixeroutperforms the ViT, MLP-Mixer, and some of their variants for similarparameter counts and data set sizes, in addition to outperforming classicalvision models such as the ResNet. Our code is available athttps://github.com/locuslab/convmixer.

Quick Read (beta)

loading the full paper ...