A Survey of Visual Transformers

Abstract

Transformer, an attention-based encoder-decoder architecture, hasrevolutionized the field of natural language processing. Inspired by thissignificant achievement, some pioneering works have recently been done onadapting Transformerliked architectures to Computer Vision (CV) fields, whichhave demonstrated their effectiveness on various CV tasks. Relying oncompetitive modeling capability, visual Transformers have achieved impressiveperformance on multiple benchmarks such as ImageNet, COCO, and ADE20k ascompared with modern Convolution Neural Networks (CNN). In this paper, we haveprovided a comprehensive review of over one hundred different visualTransformers for three fundamental CV tasks (classification, detection, andsegmentation), where a taxonomy is proposed to organize these methods accordingto their motivations, structures, and usage scenarios. Because of thedifferences in training settings and oriented tasks, we have also evaluatedthese methods on different configurations for easy and intuitive comparisoninstead of only various benchmarks. Furthermore, we have revealed a series ofessential but unexploited aspects that may empower Transformer to stand outfrom numerous architectures, e.g., slack high-level semantic embeddings tobridge the gap between visual and sequential Transformers. Finally, threepromising future research directions are suggested for further investment.

Quick Read (beta)

loading the full paper ...