NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation

Abstract

Video depth estimation aims to infer temporally consistent depth. Oneapproach is to finetune a single-image model on each video with geometryconstraints, which proves inefficient and lacks robustness. An alternative islearning to enforce consistency from data, which requires well-designed modelsand sufficient video depth data. To address both challenges, we introduce NVDS+that stabilizes inconsistent depth estimated by various single-image models ina plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild(VDW) dataset, which contains 14,203 videos with over two million frames,making it the largest natural-scene video depth dataset. Additionally, abidirectional inference strategy is designed to improve consistency byadaptively fusing forward and backward predictions. We instantiate a modelfamily ranging from small to large scales for different applications. Themethod is evaluated on VDW dataset and three public benchmarks. To furtherprove the versatility, we extend NVDS+ to video semantic segmentation andseveral downstream applications like bokeh rendering, novel view synthesis, and3D reconstruction. Experimental results show that our method achievessignificant improvements in consistency, accuracy, and efficiency. Our workserves as a solid baseline and data foundation for learning-based video depthestimation. Code and dataset are available at:https://github.com/RaymondWang987/NVDS

Quick Read (beta)

loading the full paper ...