ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Abstract

Personalized text-to-image generation using diffusion models has recentlybeen proposed and attracted lots of attention. Given a handful of imagescontaining a novel concept (e.g., a unique toy), we aim to tune the generativemodel to capture fine visual details of the novel concept and generatephotorealistic images following a text condition. We present a plug-in method,named ViCo, for fast and lightweight personalized generation. Specifically, wepropose an image attention module to condition the diffusion process on thepatch-wise visual semantics. We introduce an attention-based object mask thatcomes almost at no cost from the attention module. In addition, we design asimple regularization based on the intrinsic properties of text-image attentionmaps to alleviate the common overfitting degradation. Unlike many existingmodels, our method does not finetune any parameters of the original diffusionmodel. This allows more flexible and transferable model deployment. With onlylight parameter training (~6% of the diffusion U-Net), our method achievescomparable or even better performance than all state-of-the-art models bothqualitatively and quantitatively.

Quick Read (beta)

loading the full paper ...