Lucid Data Dreaming for Multiple Object Tracking

Abstract

Convolutional networks reach top quality in pixel-level object tracking butrequire a large amount of training data (1k~10k) to deliver such results. Wepropose a new training strategy which achieves state-of-the-art results acrossthree evaluation datasets while using 20x~100x less annotated data thancompeting methods. Our approach is suitable for both single and multiple objecttracking. Instead of using large training sets hoping to generalize acrossdomains, we generate in-domain training data using the provided annotation onthe first frame of each video to synthesize ("lucid dream") plausible futurevideo frames. In-domain per-video training data allows us to train high qualityappearance- and motion-based models, as well as tune the post-processing stage.This approach allows to reach competitive results even when training from onlya single annotated frame, without ImageNet pre-training. Our results indicatethat using a larger training set is not automatically better, and that for thetracking task a smaller training set that is closer to the target domain ismore effective. This changes the mindset regarding how many training samplesand general "objectness" knowledge are required for the object tracking task.

Quick Read (beta)

loading the full paper ...