TAO: A Large-Scale Benchmark for Tracking Any Object

Abstract

For many years, multi-object tracking benchmarks have focused on a handful ofcategories. Motivated primarily by surveillance and self-driving applications,these datasets provide tracks for people, vehicles, and animals, ignoring thevast majority of objects in the world. By contrast, in the related field ofobject detection, the introduction of large-scale, diverse datasets (e.g.,COCO) have fostered significant progress in developing highly robust solutions.To bridge this gap, we introduce a similarly diverse dataset for Tracking AnyObject (TAO). It consists of 2,907 high resolution videos, captured in diverseenvironments, which are half a minute long on average. Importantly, we adopt abottom-up approach for discovering a large vocabulary of 833 categories, anorder of magnitude more than prior tracking benchmarks. To this end, we askannotators to label objects that move at any point in the video, and give namesto them post factum. Our vocabulary is both significantly larger andqualitatively different from existing tracking datasets. To ensure scalabilityof annotation, we employ a federated approach that focuses manual effort onlabeling tracks for those relevant objects in a video (e.g., those that move).We perform an extensive evaluation of state-of-the-art trackers and make anumber of important discoveries regarding large-vocabulary tracking in anopen-world. In particular, we show that existing single- and multi-objecttrackers struggle when applied to this scenario in the wild, and thatdetection-based, multi-object trackers are in fact competitive withuser-initialized ones. We hope that our dataset and analysis will boost furtherprogress in the tracking community.

Quick Read (beta)

loading the full paper ...