Language as Queries for Referring Video Object Segmentation

Abstract

Referring video object segmentation (R-VOS) is an emerging cross-modal taskthat aims to segment the target object referred by a language expression in allvideo frames. In this work, we propose a simple and unified framework builtupon Transformer, termed ReferFormer. It views the language as queries anddirectly attends to the most relevant regions in the video frames. Concretely,we introduce a small set of object queries conditioned on the language as theinput to the Transformer. In this manner, all the queries are obligated to findthe referred objects only. They are eventually transformed into dynamic kernelswhich capture the crucial object-level information, and play the role ofconvolution filters to generate the segmentation masks from feature maps. Theobject tracking is achieved naturally by linking the corresponding queriesacross frames. This mechanism greatly simplifies the pipeline and theend-to-end framework is significantly different from the previous methods.Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences andJHMDB-Sentences show the effectiveness of ReferFormer. On Ref-Youtube-VOS,Refer-Former achieves 55.6J&F with a ResNet-50 backbone without bells andwhistles, which exceeds the previous state-of-the-art performance by 8.4points. In addition, with the strong Swin-Large backbone, ReferFormer achievesthe best J&F of 62.4 among all existing methods. The J&F metric can be furtherboosted to 63.3 by adopting a simple post-process technique. Moreover, we showthe impressive results of 55.0 mAP and 43.7 mAP on A2D-SentencesandJHMDB-Sentences respectively, which significantly outperforms the previousmethods by a large margin. Code is publicly available athttps://github.com/wjn922/ReferFormer.

Quick Read (beta)

loading the full paper ...