Self-supervised Video Object Segmentation by Motion Grouping

Abstract

Animals have evolved highly functional visual systems to understand motion,assisting perception even under complex environments. In this paper, we worktowards developing a computer vision system able to segment objects byexploiting motion cues, i.e. motion segmentation. We make the followingcontributions: First, we introduce a simple variant of the Transformer tosegment optical flow frames into primary objects and the background. Second, wetrain the architecture in a self-supervised manner, i.e. without using anymanual annotations. Third, we analyze several critical components of our methodand conduct thorough ablation studies to validate their necessity. Fourth, weevaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2,and FBMS59). Despite using only optical flow as input, our approach achievessuperior or comparable results to previous state-of-the-art self-supervisedmethods, while being an order of magnitude faster. We additionally evaluate ona challenging camouflage dataset (MoCA), significantly outperforming the otherself-supervised approaches, and comparing favourably to the top supervisedapproach, highlighting the importance of motion cues, and the potential biastowards visual appearance in existing video segmentation models.

Quick Read (beta)

loading the full paper ...