MetaFormer is Actually What You Need for Vision

Abstract

Transformers have shown great potential in computer vision tasks. A commonbelief is their attention-based token mixer module contributes most to theircompetence. However, recent works show the attention-based module intransformers can be replaced by spatial MLPs and the resulted models stillperform quite well. Based on this observation, we hypothesize that the generalarchitecture of the transformers, instead of the specific token mixer module,is more essential to the model's performance. To verify this, we deliberatelyreplace the attention module in transformers with an embarrassingly simplespatial pooling operator to conduct only the most basic token mixing.Surprisingly, we observe that the derived model, termed as PoolFormer, achievescompetitive performance on multiple computer vision tasks. For example, onImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tunedvision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracywith 35%/52% fewer parameters and 48%/60% fewer MACs. The effectiveness ofPoolFormer verifies our hypothesis and urges us to initiate the concept of"MetaFormer", a general architecture abstracted from transformers withoutspecifying the token mixer. Based on the extensive experiments, we argue thatMetaFormer is the key player in achieving superior results for recenttransformer and MLP-like models on vision tasks. This work calls for morefuture research dedicated to improving MetaFormer instead of focusing on thetoken mixer modules. Additionally, our proposed PoolFormer could serve as astarting baseline for future MetaFormer architecture design. Code is availableat https://github.com/sail-sg/poolformer

Quick Read (beta)

loading the full paper ...