Abstract
Despite advancements in self-supervised monocular depth estimation,challenges persist in dynamic scenarios due to the dependence on assumptionsabout a static world. In this paper, we present Manydepth2, a Motion-GuidedCost Volume Depth Net, to achieve precise depth estimation for both dynamicobjects and static backgrounds, all while maintaining computational efficiency.To tackle the challenges posed by dynamic content, we incorporate optical flowand coarse monocular depth to create a novel static reference frame. This frameis then utilized to build a motion-guided cost volume in collaboration with thetarget frame. Additionally, to enhance the accuracy and resilience of thenetwork structure, we introduce an attention-based depth net architecture toeffectively integrate information from feature maps with varying resolutions.Compared to methods with similar computational costs, Manydepth2 achieves asignificant reduction of approximately five percent in root-mean-square errorfor self-supervised monocular depth estimation on the KITTI-2015 dataset. Thecode could be found: https://github.com/kaichen-z/Manydepth2