Contextual Transformer Networks for Visual Recognition

Abstract

Transformer with self-attention has led to the revolutionizing of naturallanguage processing field, and recently inspires the emergence ofTransformer-style architecture design with competitive results in numerouscomputer vision tasks. Nevertheless, most of existing designs directly employself-attention over a 2D feature map to obtain the attention matrix based onpairs of isolated queries and keys at each spatial location, but leave the richcontexts among neighbor keys under-exploited. In this work, we design a novelTransformer-style module, i.e., Contextual Transformer (CoT) block, for visualrecognition. Such design fully capitalizes on the contextual information amonginput keys to guide the learning of dynamic attention matrix and thusstrengthens the capacity of visual representation. Technically, CoT block firstcontextually encodes input keys via a $3\times3$ convolution, leading to astatic contextual representation of inputs. We further concatenate the encodedkeys with input queries to learn the dynamic multi-head attention matrixthrough two consecutive $1\times1$ convolutions. The learnt attention matrix ismultiplied by input values to achieve the dynamic contextual representation ofinputs. The fusion of the static and dynamic contextual representations arefinally taken as outputs. Our CoT block is appealing in the view that it canreadily replace each $3\times3$ convolution in ResNet architectures, yielding aTransformer-style backbone named as Contextual Transformer Networks (CoTNet).Through extensive experiments over a wide range of applications (e.g., imagerecognition, object detection and instance segmentation), we validate thesuperiority of CoTNet as a stronger backbone. Source code is available at\url{https://github.com/JDAI-CV/CoTNet}.

Quick Read (beta)

loading the full paper ...