Siamese Masked Autoencoders

Abstract

Establishing correspondence between images or scenes is a significantchallenge in computer vision, especially given occlusions, viewpoint changes,and varying object appearances. In this paper, we present Siamese MaskedAutoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) forlearning visual correspondence from videos. SiamMAE operates on pairs ofrandomly sampled video frames and asymmetrically masks them. These frames areprocessed independently by an encoder network, and a decoder composed of asequence of cross-attention layers is tasked with predicting the missingpatches in the future frame. By masking a large fraction ($95\%$) of patches inthe future frame while leaving the past frame unchanged, SiamMAE encourages thenetwork to focus on object motion and learn object-centric representations.Despite its conceptual simplicity, features learned via SiamMAE outperformstate-of-the-art self-supervised methods on video object segmentation, posekeypoint propagation, and semantic part propagation tasks. SiamMAE achievescompetitive results without relying on data augmentation, handcraftedtracking-based pretext tasks, or other techniques to prevent representationalcollapse.

Quick Read (beta)

loading the full paper ...