Abstract
Latent diffusion models (LDMs) dominate high-quality image generation, yetintegrating representation learning with generative modeling remains achallenge. We introduce a novel generative image modeling framework thatseamlessly bridges this gap by leveraging a diffusion model to jointly modellow-level image latents (from a variational autoencoder) and high-levelsemantic features (from a pretrained self-supervised encoder like DINO). Ourlatent-semantic diffusion approach learns to generate coherent image-featurepairs from pure noise, significantly enhancing both generative quality andtraining efficiency, all while requiring only minimal modifications to standardDiffusion Transformer architectures. By eliminating the need for complexdistillation objectives, our unified design simplifies training and unlocks apowerful new inference strategy: Representation Guidance, which leverageslearned semantics to steer and refine image generation. Evaluated in bothconditional and unconditional settings, our method delivers substantialimprovements in image quality and training convergence speed, establishing anew direction for representation-aware generative modeling.