Abstract
In this paper, we propose MonoRec, a semi-supervised monocular densereconstruction architecture that predicts depth maps from a single movingcamera in dynamic environments. MonoRec is based on a MVS setting which encodesthe information of multiple consecutive images in a cost volume. To deal withdynamic objects in the scene, we introduce a MaskModule that predicts movingobject masks by leveraging the photometric inconsistencies encoded in the costvolumes. Unlike other MVS methods, MonoRec is able to predict accurate depthsfor both static and moving objects by leveraging the predicted masks.Furthermore, we present a novel multi-stage training scheme with asemi-supervised loss formulation that does not require LiDAR depth values. Wecarefully evaluate MonoRec on the KITTI dataset and show that it achievesstate-of-the-art performance compared to both multi-view and single-viewmethods. With the model trained on KITTI, we further demonstrate that MonoRecis able to generalize well to both the Oxford RobotCar dataset and the morechallenging TUM-Mono dataset recorded by a handheld camera. Training code andpre-trained model will be published soon.