MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Abstract

Recent breakthroughs in radiance fields have significantly advanced 3D scenereconstruction and novel view synthesis (NVS) in autonomous driving.Nevertheless, critical limitations persist: reconstruction-based methodsexhibit substantial performance deterioration under significant viewpointdeviations from training trajectories, while generation-based techniquesstruggle with temporal coherence and precise scene controllability. To overcomethese challenges, we present MuDG, an innovative framework that integratesMulti-modal Diffusion model with Gaussian Splatting (GS) for Urban SceneReconstruction. MuDG leverages aggregated LiDAR point clouds with RGB andgeometric priors to condition a multi-modal video diffusion model, synthesizingphotorealistic RGB, depth, and semantic outputs for novel viewpoints. Thissynthesis pipeline enables feed-forward NVS without computationally intensiveper-scene optimization, providing comprehensive supervision signals to refine3DGS representations for rendering robustness enhancement under extremeviewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDGoutperforms existing methods in both reconstruction and synthesis quality.

Quick Read (beta)

loading the full paper ...