Abstract
Object tracking (OT) aims to estimate the positions of target objects in avideo sequence. Depending on whether the initial states of target objects arespecified by provided annotations in the first frame or the categories, OTcould be classified as instance tracking (e.g., SOT and VOS) and categorytracking (e.g., MOT, MOTS, and VIS) tasks. Combing the advantages of the bestpractices developed in both communities, we propose a noveltracking-with-detection paradigm, where tracking supplements appearance priorsfor detection and detection provides tracking with candidate bounding boxes forassociation. Equipped with such a design, a unified tracking model,OmniTracker, is further presented to resolve all the tracking tasks with afully shared network architecture, model weights, and inference pipeline.Extensive experiments on 7 tracking datasets, including LaSOT, TrackingNet,DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieveson-par or even better results than both task-specific and unified trackingmodels.