Abstract
Open-vocabulary multiple object tracking aims to generalize trackers tounseen categories during training, enabling their application across a varietyof real-world scenarios. However, the existing open-vocabulary tracker isconstrained by its framework structure, isolated frame-level perception, andinsufficient modal interactions, which hinder its performance inopen-vocabulary classification and tracking. In this paper, we propose OVTR(End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), thefirst end-to-end open-vocabulary tracker that models motion, appearance, andcategory simultaneously. To achieve stable classification and continuoustracking, we design the CIP (Category Information Propagation) strategy, whichestablishes multiple high-level category information priors for subsequentframes. Additionally, we introduce a dual-branch structure for generalizationcapability and deep multimodal interaction, and incorporate protectivestrategies in the decoder to enhance performance. Experimental results showthat our method surpasses previous trackers on the open-vocabulary MOTbenchmark while also achieving faster inference speeds and significantlyreducing preprocessing requirements. Moreover, the experiment transferring themodel to another dataset demonstrates its strong adaptability. Models and codeare released at https://github.com/jinyanglii/OVTR.