Abstract
The precise reconstruction of 3D objects from a single RGB image in complexscenes presents a critical challenge in virtual reality, autonomous driving,and robotics. Existing neural implicit 3D representation methods facesignificant difficulties in balancing the extraction of global and localfeatures, particularly in diverse and complex environments, leading toinsufficient reconstruction precision and quality. We propose M3D, a novelsingle-view 3D reconstruction framework, to tackle these challenges. Thisframework adopts a dual-stream feature extraction strategy based on SelectiveState Spaces to effectively balance the extraction of global and localfeatures, thereby improving scene comprehension and representation precision.Additionally, a parallel branch extracts depth information, effectivelyintegrating visual and geometric features to enhance reconstruction quality andpreserve intricate details. Experimental results indicate that the fusion ofmulti-scale features with depth information via the dual-branch featureextraction significantly boosts geometric consistency and fidelity, achievingstate-of-the-art reconstruction performance.