Abstract
Estimating depth from a single 2D image is a challenging task because of theneed for stereo or multi-view data, which normally provides depth information.This paper deals with this challenge by introducing a novel deep learning-basedapproach using an encoder-decoder architecture, where the Inception-ResNet-v2model is utilized as the encoder. According to the available literature, thisis the first instance of using Inception-ResNet-v2 as an encoder for monoculardepth estimation, illustrating better performance than previous models. The useof Inception-ResNet-v2 enables our model to capture complex objects andfine-grained details effectively that are generally difficult to predict.Besides, our model incorporates multi-scale feature extraction to enhance depthprediction accuracy across different kinds of object sizes and distances. Wepropose a composite loss function consisting of depth loss, gradient edge loss,and SSIM loss, where the weights are fine-tuned to optimize the weighted sum,ensuring better balance across different aspects of depth estimation.Experimental results on the NYU Depth V2 dataset show that our model achievesstate-of-the-art performance, with an ARE of 0.064, RMSE of 0.228, and accuracy($\delta$ $<1.25$) of 89.3%. These metrics demonstrate that our modeleffectively predicts depth, even in challenging circumstances, providing ascalable solution for real-world applications in robotics, 3D reconstruction,and augmented reality.