End-to-End Referring Video Object Segmentation with Multimodal Transformers

Abstract

The referring video object segmentation task (RVOS) involves segmentation ofa text-referred object instance in the frames of a given video. Due to thecomplex nature of this multimodal task, which combines text reasoning, videounderstanding, instance segmentation and tracking, existing approachestypically rely on sophisticated pipelines in order to tackle it. In this paper,we propose a simple Transformer-based approach to RVOS. Our framework, termedMultimodal Tracking Transformer (MTTR), models the RVOS task as a sequenceprediction problem. Following recent advancements in computer vision andnatural language processing, MTTR is based on the realization that video andtext can both be processed together effectively and elegantly by a singlemultimodal Transformer model. MTTR is end-to-end trainable, free oftext-related inductive bias components and requires no additionalmask-refinement post-processing steps. As such, it simplifies the RVOS pipelineconsiderably compared to existing methods. Evaluation on standard benchmarksreveals that MTTR significantly outperforms previous art across multiplemetrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on theA2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76frames per second. In addition, we report strong results on the publicvalidation set of Refer-YouTube-VOS, a more challenging RVOS dataset that hasyet to receive the attention of researchers. The code to reproduce ourexperiments is available at https://github.com/mttr2021/MTTR

Quick Read (beta)

loading the full paper ...