Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Abstract

Learning to predict scene depth from RGB inputs is a challenging task bothfor indoor and outdoor robot navigation. In this work we address unsupervisedlearning of scene depth and robot ego-motion where supervision is provided bymonocular videos, as cameras are the cheapest, least restrictive and mostubiquitous sensor for robotics. Previous work in unsupervised image-to-depth learning has established strongbaselines in the domain. We propose a novel approach which produces higherquality results, is able to model moving objects and is shown to transferacross data domains, e.g. from outdoors to indoor scenes. The main idea is tointroduce geometric structure in the learning process, by modeling the sceneand the individual objects; camera ego-motion and object motions are learnedfrom monocular videos as input. Furthermore an online refinement method isintroduced to adapt learning on the fly to unknown domains. The proposed approach outperforms all state-of-the-art approaches, includingthose that handle motion e.g. through learned flow. Our results are comparablein quality to the ones which used stereo as supervision and significantlyimprove depth prediction on scenes and datasets which contain a lot of objectmotion. The approach is of practical relevance, as it allows transfer acrossenvironments, by transferring models trained on data collected for robotnavigation in urban scenes to indoor navigation settings. The code associatedwith this paper can be found at https://sites.google.com/view/struct2depth.

Quick Read (beta)

loading the full paper ...