$P+$: Extended Textual Conditioning in Text-to-Image Generation

Abstract

We introduce an Extended Textual Conditioning space in text-to-image models,referred to as $P+$. This space consists of multiple textual conditions,derived from per-layer prompts, each corresponding to a layer of the denoisingU-net of the diffusion model. We show that the extended space provides greater disentangling and controlover image synthesis. We further introduce Extended Textual Inversion (XTI),where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster thanthe original Textual Inversion (TI) space. The extended inversion method doesnot involve any noticeable trade-off between reconstruction and editability andinduces more regular inversions. We conduct a series of extensive experiments to analyze and understand theproperties of the new space, and to showcase the effectiveness of our methodfor personalizing text-to-image models. Furthermore, we utilize the uniqueproperties of this space to achieve previously unattainable results inobject-style mixing using text-to-image models. Project page:https://prompt-plus.github.io

Quick Read (beta)

loading the full paper ...