Abstract
In this paper, we propose a novel video depth estimation approach,FutureDepth, which enables the model to implicitly leverage multi-frame andmotion cues to improve depth estimation by making it learn to predict thefuture at training. More specifically, we propose a future prediction network,F-Net, which takes the features of multiple consecutive frames and is trainedto predict multi-frame features one time step ahead iteratively. In this way,F-Net learns the underlying motion and correspondence information, and weincorporate its features into the depth decoding process. Additionally, toenrich the learning of multiframe correspondence cues, we further leverage areconstruction network, R-Net, which is trained via adaptively maskedauto-encoding of multiframe feature volumes. At inference time, both F-Net andR-Net are used to produce queries to work with the depth decoder, as well as afinal refinement network. Through extensive experiments on several benchmarks,i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, andopen-domain scenarios, we show that FutureDepth significantly improves uponbaseline models, outperforms existing video depth estimation methods, and setsnew state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is moreefficient than existing SOTA video depth estimation models and has similarlatencies when comparing to monocular models