Abstract
Monocular visual odometry (VO) suffers severely from error accumulationduring frame-to-frame pose estimation. In this paper, we present aself-supervised learning method for VO with special consideration forconsistency over longer sequences. To this end, we model the long-termdependency in pose prediction using a pose network that features a two-layerconvolutional LSTM module. We train the networks with purely self-supervisedlosses, including a cycle consistency loss that mimics the loop closure modulein geometric VO. Inspired by prior geometric systems, we allow the networks tosee beyond a small temporal window during training, through a novel a loss thatincorporates temporally distant (e.g., O(100)) frames. Given GPU memoryconstraints, we propose a stage-wise training mechanism, where the first stageoperates in a local time window and the second stage refines the poses with a"global" loss given the first stage features. We demonstrate competitiveresults on several standard VO datasets, including KITTI and TUM RGB-D.