Abstract
Autonomous driving systems rely heavily on multimodal perception data tounderstand complex environments. However, the long-tailed distribution ofreal-world data hinders generalization, especially for rare but safety-criticalvehicle categories. To address this challenge, we propose MultiEditor, adual-branch latent diffusion framework designed to edit images and LiDAR pointclouds in driving scenarios jointly. At the core of our approach is introducing3D Gaussian Splatting (3DGS) as a structural and appearance prior for targetobjects. Leveraging this prior, we design a multi-level appearance controlmechanism--comprising pixel-level pasting, semantic-level guidance, andmulti-branch refinement--to achieve high-fidelity reconstruction acrossmodalities. We further propose a depth-guided deformable cross-modalitycondition module that adaptively enables mutual guidance between modalitiesusing 3DGS-rendered depth, significantly enhancing cross-modality consistency.Extensive experiments demonstrate that MultiEditor achieves superiorperformance in visual and geometric fidelity, editing controllability, andcross-modality consistency. Furthermore, generating rare-category vehicle datawith MultiEditor substantially enhances the detection accuracy of perceptionmodels on underrepresented classes.