Abstract
We present a novel approach to unsupervised learning for video objectsegmentation (VOS). Unlike previous work, our formulation allows to learn densefeature representations directly in a fully convolutional regime. We rely onuniform grid sampling to extract a set of anchors and train our model todisambiguate between them on both inter- and intra-video levels. However, anaive scheme to train such a model results in a degenerate solution. We proposeto prevent this with a simple regularisation scheme, accommodating theequivariance property of the segmentation task to similarity transformations.Our training objective admits efficient implementation and exhibits fasttraining convergence. On established VOS benchmarks, our approach exceeds thesegmentation accuracy of previous work despite using significantly lesstraining data and compute power.