Abstract
Video object segmentation (VOS) is a highly challenging problem, since thetarget object is only defined during inference with a given first-framereference mask. The problem of how to capture and utilize this limited targetinformation remains a fundamental research question. We address this byintroducing an end-to-end trainable VOS architecture that integrates adifferentiable few-shot learning module. This internal learner is designed topredict a powerful parametric model of the target by minimizing a segmentationerror in the first frame. We further go beyond standard few-shot learningtechniques by learning what the few-shot learner should learn. This allows usto achieve a rich internal representation of the target in the current frame,significantly increasing the segmentation accuracy of our approach. We performextensive experiments on multiple benchmarks. Our approach sets a newstate-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving anoverall score of 81.5, corresponding to a 2.6% relative improvement over theprevious best result.