Abstract
The self-supervised learning of depth and pose from monocular sequencesprovides an attractive solution by using the photometric consistency of nearbyframes as it depends much less on the ground-truth data. In this paper, weaddress the issue when previous assumptions of the self-supervised approachesare violated due to the dynamic nature of real-world scenes. Different fromhandling the noise as uncertainty, our key idea is to incorporate more robustgeometric quantities and enforce internal consistency in the temporal imagesequence. As demonstrated on commonly used benchmark datasets, the proposedmethod substantially improves the state-of-the-art methods on both depth andrelative pose estimation for monocular image sequences, without addinginference overhead.