Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

Abstract

Diffusion models have shown superior performance in image generation andmanipulation, but the inherent stochasticity presents challenges in preservingand manipulating image content and identity. While previous approaches likeDreamBooth and Textual Inversion have proposed model or latent representationpersonalization to maintain the content, their reliance on multiple referenceimages and complex training limits their practicality. In this paper, wepresent a simple yet highly effective approach to personalization using highlypersonalized (HiPer) text embedding by decomposing the CLIP embedding space forpersonalization and content manipulation. Our method does not require modelfine-tuning or identifiers, yet still enables manipulation of background,texture, and motion with just a single image and target text. Throughexperiments on diverse target texts, we demonstrate that our approach produceshighly personalized and complex semantic image edits across a wide range oftasks. We believe that the novel understanding of the text embedding spacepresented in this work has the potential to inspire further research acrossvarious tasks.

Quick Read (beta)

loading the full paper ...