Break-A-Scene: Extracting Multiple Concepts from a Single Image

Abstract

Text-to-image model personalization aims to introduce a user-provided conceptto the model, allowing its synthesis in diverse contexts. However, currentmethods primarily focus on the case of learning a single concept from multipleimages with variations in backgrounds and poses, and struggle when adapted to adifferent scenario. In this work, we introduce the task of textual scenedecomposition: given a single image of a scene that may contain severalconcepts, we aim to extract a distinct text token for each concept, enablingfine-grained control over the generated scenes. To this end, we proposeaugmenting the input image with masks that indicate the presence of targetconcepts. These masks can be provided by the user or generated automatically bya pre-trained segmentation model. We then present a novel two-phasecustomization process that optimizes a set of dedicated textual embeddings(handles), as well as the model weights, striking a delicate balance betweenaccurately capturing the concepts and avoiding overfitting. We employ a maskeddiffusion loss to enable handles to generate their assigned concepts,complemented by a novel loss on cross-attention maps to prevent entanglement.We also introduce union-sampling, a training strategy aimed to improve theability of combining multiple concepts in generated images. We use severalautomatic metrics to quantitatively compare our method against severalbaselines, and further affirm the results using a user study. Finally, weshowcase several applications of our method. Project page is available at:https://omriavrahami.com/break-a-scene/

Quick Read (beta)

loading the full paper ...