Temporal Collection and Distribution for Referring Video Object Segmentation

Abstract

Referring video object segmentation aims to segment a referent throughout avideo sequence according to a natural language expression. It requires aligningthe natural language expression with the objects' motions and their dynamicassociations at the global video level but segmenting objects at the framelevel. To achieve this goal, we propose to simultaneously maintain a globalreferent token and a sequence of object queries, where the former isresponsible for capturing video-level referent according to the languageexpression, while the latter serves to better locate and segment objects witheach frame. Furthermore, to explicitly capture object motions andspatial-temporal cross-modal reasoning over objects, we propose a noveltemporal collection-distribution mechanism for interacting between the globalreferent token and object queries. Specifically, the temporal collectionmechanism collects global information for the referent token from objectqueries to the temporal motions to the language expression. In turn, thetemporal distribution first distributes the referent token to the referentsequence across all frames and then performs efficient cross-frame reasoningbetween the referent sequence and object queries in every frame. Experimentalresults show that our method outperforms state-of-the-art methods on allbenchmarks consistently and significantly.

Quick Read (beta)

loading the full paper ...