Abstract
We argue that diffusion models' success in modeling complex distributions is,for the most part, coming from their input conditioning. This paperinvestigates the representation used to condition diffusion models from theperspective that ideal representations should improve sample fidelity, be easyto generate, and be compositional to allow out-of-training samples generation.We introduce Discrete Latent Code (DLC), an image representation derived fromSimplicial Embeddings trained with a self-supervised learning objective. DLCsare sequences of discrete tokens, as opposed to the standard continuous imageembeddings. They are easy to generate and their compositionality enablessampling of novel images beyond the training distribution. Diffusion modelstrained with DLCs have improved generation fidelity, establishing a newstate-of-the-art for unconditional image generation on ImageNet. Additionally,we show that composing DLCs allows the image generator to produceout-of-distribution samples that coherently combine the semantics of images indiverse ways. Finally, we showcase how DLCs can enable text-to-image generationby leveraging large-scale pretrained language models. We efficiently finetune atext diffusion language model to generate DLCs that produce novel samplesoutside of the image generator training distribution.