Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images

Abstract

Deep learning based object detectors require thousands of diversifiedbounding box and class annotated examples. Though image object detectors haveshown rapid progress in recent years with the release of multiple large-scalestatic image datasets, object detection on videos still remains an open problemdue to scarcity of annotated video frames. Having a robust video objectdetector is an essential component for video understanding and curatinglarge-scale automated annotations in videos. Domain difference between imagesand videos makes the transferability of image object detectors to videossub-optimal. The most common solution is to use weakly supervised annotationswhere a video frame has to be tagged for presence/absence of object categories.This still takes up manual effort. In this paper we take a step forward byadapting the concept of unsupervised adversarial image-to-image translation toperturb static high quality images to be visually indistinguishable from a setof video frames. We assume the presence of a fully annotated static imagedataset and an unannotated video dataset. Object detector is trained onadversarially transformed image dataset using the annotations of the originaldataset. Experiments on Youtube-Objects and Youtube-Objects-Subset datasetswith two contemporary baseline object detectors reveal that such unsupervisedpixel level domain adaptation boosts the generalization performance on videoframes compared to direct application of original image object detector. Also,we achieve competitive performance compared to recent baselines of weaklysupervised methods. This paper can be seen as an application of imagetranslation for cross domain object detection.

Quick Read (beta)

loading the full paper ...