Abstract
Despite significant advances in modeling image priors via diffusion models,3D-aware image editing remains challenging, in part because the object is onlyspecified via a single image. To tackle this challenge, we propose 3D-Fixup, anew framework for editing 2D images guided by learned 3D priors. The frameworksupports difficult editing situations such as object translation and 3Drotation. To achieve this, we leverage a training-based approach that harnessesthe generative power of diffusion models. As video data naturally encodesreal-world physical dynamics, we turn to video data for generating trainingdata pairs, i.e., a source and a target frame. Rather than relying solely on asingle trained model to infer transformations between source and target frames,we incorporate 3D guidance from an Image-to-3D model, which bridges thischallenging task by explicitly projecting 2D information into 3D space. Wedesign a data generation pipeline to ensure high-quality 3D guidance throughouttraining. Results show that by integrating these 3D priors, 3D-Fixupeffectively supports complex, identity coherent 3D-aware edits, achievinghigh-quality results and advancing the application of diffusion models inrealistic image manipulation. The code is provided athttps://3dfixup.github.io/