Abstract
In this paper we tackle a fundamental question: "Can we train latentdiffusion models together with the variational auto-encoder (VAE) tokenizer inan end-to-end manner?" Traditional deep-learning wisdom dictates thatend-to-end training is often preferable when possible. However, for latentdiffusion transformers, it is observed that end-to-end training both VAE anddiffusion-model using standard diffusion-loss is ineffective, even causing adegradation in final performance. We show that while diffusion loss isineffective, end-to-end training can be unlocked through therepresentation-alignment (REPA) loss -- allowing both VAE and diffusion modelto be jointly tuned during the training process. Despite its simplicity, theproposed training recipe (REPA-E) shows remarkable performance; speeding updiffusion model training by over 17x and 45x over REPA and vanilla trainingrecipes, respectively. Interestingly, we observe that end-to-end tuning withREPA-E also improves the VAE itself; leading to improved latent space structureand downstream generation performance. In terms of final performance, ourapproach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with andwithout classifier-free guidance on ImageNet 256 x 256. Code is available athttps://end2end-diffusion.github.io.