Abstract
Self-supervised monocular depth estimation that does not require ground-truthfor training has attracted attention in recent years. It is of high interest todesign lightweight but effective models, so that they can be deployed on edgedevices. Many existing architectures benefit from using heavier backbones atthe expense of model sizes. In this paper we achieve comparable results with alightweight architecture. Specifically, we investigate the efficientcombination of CNNs and Transformers, and design a hybrid architectureLite-Mono. A Consecutive Dilated Convolutions (CDC) module and a Local-GlobalFeatures Interaction (LGFI) module are proposed. The former is used to extractrich multi-scale local features, and the latter takes advantage of theself-attention mechanism to encode long-range global information into thefeatures. Experiments demonstrate that our full model outperforms Monodepth2 bya large margin in accuracy, with about 80% fewer trainable parameters.