Abstract
Universal Image Segmentation is not a new concept. Past attempts to unifyimage segmentation in the last decades include scene parsing, panopticsegmentation, and, more recently, new panoptic architectures. However, suchpanoptic architectures do not truly unify image segmentation because they needto be trained individually on the semantic, instance, or panoptic segmentationto achieve the best performance. Ideally, a truly universal framework should betrained only once and achieve SOTA performance across all three imagesegmentation tasks. To that end, we propose OneFormer, a universal imagesegmentation framework that unifies segmentation with a multi-task train-oncedesign. We first propose a task-conditioned joint training strategy thatenables training on ground truths of each domain (semantic, instance, andpanoptic segmentation) within a single multi-task training process. Secondly,we introduce a task token to condition our model on the task at hand, makingour model task-dynamic to support multi-task training and inference. Thirdly,we propose using a query-text contrastive loss during training to establishbetter inter-task and inter-class distinctions. Notably, our single OneFormermodel outperforms specialized Mask2Former models across all three segmentationtasks on ADE20k, CityScapes, and COCO, despite the latter being trained on eachof the three tasks individually with three times the resources. With newConvNeXt and DiNAT backbones, we observe even more performance improvement. Webelieve OneFormer is a significant step towards making image segmentation moreuniversal and accessible. To support further research, we open-source our codeand models at https://github.com/SHI-Labs/OneFormer