Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition

Abstract

Recent text-to-image diffusion models are able to learn and synthesize imagescontaining novel, personalized concepts (e.g., their own pets or specificitems) with just a few examples for training. This paper tackles twointerconnected issues within this realm of personalizing text-to-imagediffusion models. First, current personalization techniques fail to reliablyextend to multiple concepts -- we hypothesize this to be due to the mismatchbetween complex scenes and simple text descriptions in the pre-training dataset(e.g., LAION). Second, given an image containing multiple personalizedconcepts, there lacks a holistic metric that evaluates performance on not justthe degree of resemblance of personalized concepts, but also whether allconcepts are present in the image and whether the image accurately reflects theoverall text description. To address these issues, we introduce Gen4Gen, asemi-automated dataset creation pipeline utilizing generative models to combinepersonalized concepts into complex compositions along with text-descriptions.Using this, we create a dataset called MyCanvas, that can be used to benchmarkthe task of multi-concept personalization. In addition, we design acomprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for betterquantifying the performance of multi-concept, personalized text-to-imagediffusion methods. We provide a simple baseline built on top of CustomDiffusion with empirical prompting strategies for future researchers toevaluate on MyCanvas. We show that by improving data quality and promptingstrategies, we can significantly increase multi-concept personalized imagegeneration quality, without requiring any modifications to model architectureor training algorithms.

Quick Read (beta)

loading the full paper ...