Bootstrapping Referring Multi-Object Tracking

Abstract

Referring understanding is a fundamental task that bridges natural languageand visual content by localizing objects described in free-form expressions.However, existing works are constrained by limited language expressiveness,lacking the capacity to model object dynamics in spatial numbers and temporalstates. To address these limitations, we introduce a new and general referringunderstanding task, termed referring multi-object tracking (RMOT). Its coreidea is to employ a language expression as a semantic cue to guide theprediction of multi-object tracking, comprehensively accounting for variationsin object quantity and temporal semantics. Along with RMOT, we introduce a RMOTbenchmark named Refer-KITTI-V2, featuring scalable and diverse languageexpressions. To efficiently generate high-quality annotations covering objectdynamics with minimal manual effort, we propose a semi-automatic labelingpipeline that formulates a total of 9,758 language prompts. In addition, wepropose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT.At its core is a query-driven Temporal Enhancement Module that represents eachobject as a Transformer query, enabling long-term spatial-temporal interactionswith other objects and past frames to efficiently refine these queries.TempRMOT achieves state-of-the-art performance on both Refer-KITTI andRefer-KITTI-V2, demonstrating the effectiveness of our approach. The sourcecode and dataset is available at https://github.com/zyn213/TempRMOT.

Quick Read (beta)

loading the full paper ...