Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

Abstract

Contrastive learning methods for unsupervised visual representation learninghave reached remarkable levels of transfer performance. We argue that the powerof contrastive learning has yet to be fully unleashed, as current methods aretrained only on instance-level pretext tasks, leading to representations thatmay be sub-optimal for downstream tasks requiring dense pixel predictions. Inthis paper, we introduce pixel-level pretext tasks for learning dense featurerepresentations. The first task directly applies contrastive learning at thepixel level. We additionally propose a pixel-to-propagation consistency taskthat produces better results, even surpassing the state-of-the-art approachesby a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2mIoU when transferred to Pascal VOC object detection (C4), COCO objectdetection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than theprevious best methods built on instance-level contrastive learning. Moreover,the pixel-level pretext tasks are found to be effective for pre-training notonly regular backbone networks but also head networks used for dense downstreamtasks, and are complementary to instance-level contrastive methods. Theseresults demonstrate the strong potential of defining pretext tasks at the pixellevel, and suggest a new path forward in unsupervised visual representationlearning.

Quick Read (beta)

loading the full paper ...