Abstract
The recent success of transformer-based image generative models inobject-centric learning highlights the importance of powerful image generatorsfor handling complex scenes. However, despite the high expressiveness ofdiffusion models in image generation, their integration into object-centriclearning remains largely unexplored in this domain. In this paper, we explorethe feasibility and potential of integrating diffusion models intoobject-centric learning and investigate the pros and cons of this approach. Weintroduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes:it is the first object-centric learning model to replace conventional slotdecoders with a latent diffusion model conditioned on object slots, and it isalso the first unsupervised compositional conditional diffusion model thatoperates without the need for supervised annotations like text. Throughexperiments on various object-centric tasks, including the first application ofthe FFHQ dataset in this field, we demonstrate that LSD significantlyoutperforms state-of-the-art transformer-based decoders, particularly in morecomplex scenes, and exhibits superior unsupervised compositional generationquality. Project page is available at$\href{https://latentslotdiffusion.github.io}{here}$