Image segmentation is about grouping pixels with different semantics, e.g.,category or instance membership, where each choice of semantics defines a task.While only the semantics of each task differ, current research focuses ondesigning specialized architectures for each task. We present Masked-attentionMask Transformer (Mask2Former), a new architecture capable of addressing anyimage segmentation task (panoptic, instance or semantic). Its key componentsinclude masked attention, which extracts localized features by constrainingcross-attention within predicted mask regions. In addition to reducing theresearch effort by at least three times, it outperforms the best specializedarchitectures by a significant margin on four popular datasets. Most notably,Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ onCOCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7mIoU on ADE20K).