Point-VOS: Pointing Up Video Object Segmentation

Abstract

Current state-of-the-art Video Object Segmentation (VOS) methods rely ondense per-object mask annotations both during training and testing. Thisrequires time-consuming and costly video annotation mechanisms. We propose anovel Point-VOS task with a spatio-temporally sparse point-wise annotationscheme that substantially reduces the annotation effort. We apply ourannotation scheme to two large-scale video datasets with text descriptions andannotate over 19M points across 133K objects in 32K videos. Based on ourannotations, we propose a new Point-VOS benchmark, and a correspondingpoint-based training mechanism, which we use to establish strong baselineresults. We show that existing VOS methods can easily be adapted to leverageour point annotations during training, and can achieve results close to thefully-supervised performance when trained on pseudo-masks generated from thesepoints. In addition, we show that our data can be used to improve models thatconnect vision and language, by evaluating it on the Video Narrative Grounding(VNG) task. We will make our code and annotations available athttps://pointvos.github.io.

Quick Read (beta)

loading the full paper ...