Abstract
Text-to-image (TTI) diffusion models have demonstrated impressive results ingenerating high-resolution images of complex and imaginative scenes. Recentapproaches have further extended these methods with personalization techniquesthat allow them to integrate user-illustrated concepts (e.g., the userhim/herself) using a few sample image illustrations. However, the ability togenerate images with multiple interacting concepts, such as human subjects, aswell as concepts that may be entangled in one, or across multiple, imageillustrations remains illusive. In this work, we propose a concept-driven TTIpersonalization framework that addresses these core challenges. We build onexisting works that learn custom tokens for user-illustrated concepts, allowingthose to interact with existing text tokens in the TTI model. However,importantly, to disentangle and better learn the concepts in question, wejointly learn (latent) segmentation masks that disentangle these concepts inuser-provided image illustrations. We do so by introducing an ExpectationMaximization (EM)-like optimization procedure where we alternate betweenlearning the custom tokens and estimating masks encompassing correspondingconcepts in user-supplied images. We obtain these masks based oncross-attention, from within the U-Net parameterized latent diffusion model andsubsequent Dense CRF optimization. We illustrate that such joint alternatingrefinement leads to the learning of better tokens for concepts and, as abi-product, latent masks. We illustrate the benefits of the proposed approachqualitatively and quantitatively (through user studies) with a number ofexamples and use cases that can combine up to three entangled concepts.