A ConvNet for the 2020s

Abstract

The "Roaring 20s" of visual recognition began with the introduction of VisionTransformers (ViTs), which quickly superseded ConvNets as the state-of-the-artimage classification model. A vanilla ViT, on the other hand, facesdifficulties when applied to general computer vision tasks such as objectdetection and semantic segmentation. It is the hierarchical Transformers (e.g.,Swin Transformers) that reintroduced several ConvNet priors, makingTransformers practically viable as a generic vision backbone and demonstratingremarkable performance on a wide variety of vision tasks. However, theeffectiveness of such hybrid approaches is still largely credited to theintrinsic superiority of Transformers, rather than the inherent inductivebiases of convolutions. In this work, we reexamine the design spaces and testthe limits of what a pure ConvNet can achieve. We gradually "modernize" astandard ResNet toward the design of a vision Transformer, and discover severalkey components that contribute to the performance difference along the way. Theoutcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt.Constructed entirely from standard ConvNet modules, ConvNeXts compete favorablywith Transformers in terms of accuracy and scalability, achieving 87.8%ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detectionand ADE20K segmentation, while maintaining the simplicity and efficiency ofstandard ConvNets.

Quick Read (beta)

loading the full paper ...