Abstract
Most recent semantic segmentation methods adopt a fully-convolutional network(FCN) with an encoder-decoder architecture. The encoder progressively reducesthe spatial resolution and learns more abstract/semantic visual concepts withlarger receptive fields. Since context modeling is critical for segmentation,the latest efforts have been focused on increasing the receptive field, througheither dilated/atrous convolutions or inserting attention modules. However, theencoder-decoder based FCN architecture remains unchanged. In this paper, we aimto provide an alternative perspective by treating semantic segmentation as asequence-to-sequence prediction task. Specifically, we deploy a puretransformer (ie, without convolution and resolution reduction) to encode animage as a sequence of patches. With the global context modeled in every layerof the transformer, this encoder can be combined with a simple decoder toprovide a powerful segmentation model, termed SEgmentation TRansformer (SETR).Extensive experiments show that SETR achieves new state of the art on ADE20K(50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results onCityscapes. Particularly, we achieve the first (44.42% mIoU) position in thehighly competitive ADE20K test server leaderboard.