Abstract
Recent text-to-image generative models have demonstrated an unparalleledability to generate diverse and creative imagery guided by a target textprompt. While revolutionary, current state-of-the-art diffusion models maystill fail in generating images that fully convey the semantics in the giventext prompt. We analyze the publicly available Stable Diffusion model andassess the existence of catastrophic neglect, where the model fails to generateone or more of the subjects from the input prompt. Moreover, we find that insome cases the model also fails to correctly bind attributes (e.g., colors) totheir corresponding subjects. To help mitigate these failure cases, weintroduce the concept of Generative Semantic Nursing (GSN), where we seek tointervene in the generative process on the fly during inference time to improvethe faithfulness of the generated images. Using an attention-based formulationof GSN, dubbed Attend-and-Excite, we guide the model to refine thecross-attention units to attend to all subject tokens in the text prompt andstrengthen - or excite - their activations, encouraging the model to generateall subjects described in the text prompt. We compare our approach toalternative approaches and demonstrate that it conveys the desired conceptsmore faithfully across a range of text prompts.