TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

Abstract

Unsupervised semantic segmentation aims to obtain high-level semanticrepresentation on low-level visual features without manual annotations. Mostexisting methods are bottom-up approaches that try to group pixels into regionsbased on their visual cues or certain predefined rules. As a result, it isdifficult for these bottom-up approaches to generate fine-grained semanticsegmentation when coming to complicated scenes with multiple objects and someobjects sharing similar visual appearance. In contrast, we propose the firsttop-down unsupervised semantic segmentation framework for fine-grainedsegmentation in extremely complicated scenarios. Specifically, we first obtainrich high-level structured semantic concept information from large-scale visiondata in a self-supervised learning manner, and use such information as a priorto discover potential semantic categories presented in target datasets.Secondly, the discovered high-level semantic categories are mapped to low-levelpixel features by calculating the class activate map (CAM) with respect tocertain discovered semantic representation. Lastly, the obtained CAMs serve aspseudo labels to train the segmentation module and produce final semanticsegmentation. Experimental results on multiple semantic segmentation benchmarksshow that our top-down unsupervised segmentation is robust to bothobject-centric and scene-centric datasets under different semantic granularitylevels, and outperforms all the current state-of-the-art bottom-up methods. Ourcode is available at \url{https://github.com/damo-cv/TransFGU}.

Quick Read (beta)

loading the full paper ...