An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Abstract

Text-to-image models offer unprecedented freedom to guide creation throughnatural language. Yet, it is unclear how such freedom can be exercised togenerate images of specific unique concepts, modify their appearance, orcompose them in new roles and novel scenes. In other words, we ask: how can weuse language-guided models to turn our cat into a painting, or imagine a newproduct based on our favorite toy? Here we present a simple approach thatallows such creative freedom. Using only 3-5 images of a user-provided concept,like an object or a style, we learn to represent it through new "words" in theembedding space of a frozen text-to-image model. These "words" can be composedinto natural language sentences, guiding personalized creation in an intuitiveway. Notably, we find evidence that a single word embedding is sufficient forcapturing unique and varied concepts. We compare our approach to a wide rangeof baselines, and demonstrate that it can more faithfully portray the conceptsacross a range of applications and tasks. Our code, data and new words will be available at:https://textual-inversion.github.io

Quick Read (beta)

loading the full paper ...