The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth

Abstract

Self-supervised monocular depth estimation networks are trained to predictscene depth using nearby frames as a supervision signal during training.However, for many applications, sequence information in the form of videoframes is also available at test time. The vast majority of monocular networksdo not make use of this extra signal, thus ignoring valuable information thatcould be used to improve the predicted depth. Those that do, either usecomputationally expensive test-time refinement techniques or off-the-shelfrecurrent networks, which only indirectly make use of the geometric informationthat is inherently available. We propose ManyDepth, an adaptive approach to dense depth estimation that canmake use of sequence information at test time, when it is available. Takinginspiration from multi-view stereo, we propose a deep end-to-end cost volumebased approach that is trained using self-supervision only. We present a novelconsistency loss that encourages the network to ignore the cost volume when itis deemed unreliable, e.g. in the case of moving objects, and an augmentationscheme to cope with static cameras. Our detailed experiments on both KITTI andCityscapes show that we outperform all published self-supervised baselines,including those that use single or multiple frames at test time.

Quick Read (beta)

loading the full paper ...