Abstract
Recent large-scale text-to-image diffusion models generate photorealisticimages but often struggle to accurately depict interactions between humans andobjects due to their limited ability to differentiate various interactionwords. In this work, we propose VerbDiff to address the challenge of capturingnuanced interactions within text-to-image diffusion models. VerbDiff is a noveltext-to-image generation model that weakens the bias between interaction wordsand objects, enhancing the understanding of interactions. Specifically, wedisentangle various interaction words from frequency-based anchor words andleverage localized interaction regions from generated images to help the modelbetter capture semantics in distinctive words without extra conditions. Ourapproach enables the model to accurately understand the intended interactionbetween humans and objects, producing high-quality images with accurateinteractions aligned with specified verbs. Extensive experiments on theHICO-DET dataset demonstrate the effectiveness of our method compared toprevious approaches.