STEP: Segmenting and Tracking Every Pixel

Abstract

In this paper, we tackle video panoptic segmentation, a task that requiresassigning semantic classes and track identities to all pixels in a video. Tostudy this important problem in a setting that requires a continuousinterpretation of sensory data, we present a new benchmark: Segmenting andTracking Every Pixel (STEP), encompassing two datasets, KITTI-STEP, andMOTChallenge-STEP together with a new evaluation metric. Our work is the firstthat targets this task in a real-world setting that requires denseinterpretation in both spatial and temporal domains. As the ground-truth forthis task is difficult and expensive to obtain, existing datasets are eitherconstructed synthetically or only sparsely annotated within short video clips.By contrast, our datasets contain long video sequences, providing challengingexamples and a test-bed for studying long-term pixel-precise segmentation andtracking. For measuring the performance, we propose a novel evaluation metricSegmentation and Tracking Quality (STQ) that fairly balances semantic andtracking aspects of this task and is suitable for evaluating sequences ofarbitrary length. We will make our datasets, metric, and baselines publiclyavailable.

Quick Read (beta)

loading the full paper ...