Per-Pixel Classification is Not All You Need for Semantic Segmentation

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixelclassification task, while instance-level segmentation is handled with analternative mask classification. Our key insight: mask classification issufficiently general to solve both semantic- and instance-level segmentationtasks in a unified manner using the exact same model, loss, and trainingprocedure. Following this observation, we propose MaskFormer, a simple maskclassification model which predicts a set of binary masks, each associated witha single global class label prediction. Overall, the proposed maskclassification-based method simplifies the landscape of effective approaches tosemantic and panoptic segmentation tasks and shows excellent empirical results.In particular, we observe that MaskFormer outperforms per-pixel classificationbaselines when the number of classes is large. Our mask classification-basedmethod outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K)and panoptic segmentation (52.7 PQ on COCO) models.

Quick Read (beta)

loading the full paper ...