TDT: Teaching Detectors to Track without Fully Annotated Videos

Abstract

Recently, one-stage trackers that use a joint model to predict bothdetections and appearance embeddings in one forward pass received muchattention and achieved state-of-the-art results on the Multi-Object Tracking(MOT) benchmarks. However, their success depends on the availability of videosthat are fully annotated with tracking data, which is expensive and hard toobtain. This can limit the model generalization. In comparison, the two-stageapproach, which performs detection and embedding separately, is slower buteasier to train as their data are easier to annotate. We propose to combine thebest of the two worlds through a data distillation approach. Specifically, weuse a teacher embedder, trained on Re-ID datasets, to generate pseudoappearance embedding labels for the detection datasets. Then, we use theaugmented dataset to train a detector that is also capable of regressing thesepseudo-embeddings in a fully-convolutional fashion. Our proposed one-stagesolution matches the two-stage counterpart in quality but is 3 times faster.Even though the teacher embedder has not seen any tracking data duringtraining, our proposed tracker achieves competitive performance with somepopular trackers (e.g. JDE) trained with fully labeled tracking data.

Quick Read (beta)

loading the full paper ...