MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Abstract

We propose MAMo, a novel memory and attention frame-work for monocular videodepth estimation. MAMo can augment and improve any single-image depthestimation networks into video depth estimation models, enabling them to takeadvantage of the temporal information to predict more accurate depth. In MAMo,we augment model with memory which aids the depth prediction as the modelstreams through the video. Specifically, the memory stores learned visual anddisplacement tokens of the previous time instances. This allows the depthnetwork to cross-reference relevant features from the past when predictingdepth on the current frame. We introduce a novel scheme to continuously updatethe memory, optimizing it to keep tokens that correspond with both the past andthe present visual information. We adopt attention-based approach to processmemory features where we first learn the spatio-temporal relation among theresultant visual and displacement memory tokens using self-attention module.Further, the output features of self-attention are aggregated with the currentvisual features through cross-attention. The cross-attended features arefinally given to a decoder to predict depth on the current frame. Throughextensive experiments on several benchmarks, including KITTI, NYU-Depth V2, andDDAD, we show that MAMo consistently improves monocular depth estimationnetworks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo videodepth estimation provides higher accuracy with lower latency, when omparing toSOTA cost-volume-based video depth models.

Quick Read (beta)

loading the full paper ...