Abstract
Fully Convolutional Neural Networks (FCNNs) with contracting and expandingpaths have shown prominence for the majority of medical image segmentationapplications since the past decade. In FCNNs, the encoder plays an integralrole by learning both global and local features and contextual representationswhich can be utilized for semantic output prediction by the decoder. Despitetheir success, the locality of convolutional layers in FCNNs, limits thecapability of learning long-range spatial dependencies. Inspired by the recentsuccess of transformers for Natural Language Processing (NLP) in long-rangesequence learning, we reformulate the task of volumetric (3D) medical imagesegmentation as a sequence-to-sequence prediction problem. We introduce a novelarchitecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformeras the encoder to learn sequence representations of the input volume andeffectively capture the global multi-scale information, while also followingthe successful "U-shaped" network design for the encoder and decoder. Thetransformer encoder is directly connected to a decoder via skip connections atdifferent resolutions to compute the final semantic segmentation output. Wehave validated the performance of our method on the Multi Atlas Labeling BeyondThe Cranial Vault (BTCV) dataset for multi-organ segmentation and the MedicalSegmentation Decathlon (MSD) dataset for brain tumor and spleen segmentationtasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCVleaderboard. Code: https://monai.io/research/unetr