DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning

  • 2025-06-17 17:49:27
  • Kunal Swami, Debtanu Gupta, Amrit Kumar Muduli, Chirag Jaiswal, Pankaj Kumar Bajpai
  • 0

Abstract

Depth estimation is crucial for intelligent systems, enabling applicationsfrom autonomous navigation to augmented reality. While traditional stereo andactive depth sensors have limitations in cost, power, and robustness,dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compellingalternative. This paper introduces DiFuse-Net, a novel modality decouplednetwork design for disentangled RGB and DP based depth estimation. DiFuse-Netfeatures a window bi-directional parallax attention mechanism (WBiPAM)specifically designed to capture the subtle DP disparity cues unique tosmartphone cameras with small aperture. A separate encoder extracts contextualinformation from the RGB image, and these features are fused to enhance depthprediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism toutilize large-scale RGB-D datasets in the literature to cope with thelimitations of obtaining large-scale RGB-DP-D dataset. Our evaluation andcomparison of the proposed method demonstrates its superiority over the DP andstereo-based baseline methods. Additionally, we contribute a new, high-quality,real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP)dataset, created using our novel symmetric stereo camera hardware setup, stereocalibration and rectification protocol, and AI stereo disparity estimationmethod.

 

Quick Read (beta)

loading the full paper ...