Target-Aware Video Diffusion Models

Abstract

We present a target-aware video diffusion model that generates videos from aninput image in which an actor interacts with a specified target whileperforming a desired action. The target is defined by a segmentation mask andthe desired action is described via a text prompt. Unlike existing controllableimage-to-video diffusion models that often rely on dense structural or motioncues to guide the actor's movements toward the target, our target-aware modelrequires only a simple mask to indicate the target, leveraging thegeneralization capabilities of pretrained models to produce plausible actions.This makes our method particularly effective for human-object interaction (HOI)scenarios, where providing precise action guidance is challenging, and furtherenables the use of video diffusion models for high-level action planning inapplications such as robotics. We build our target-aware model by extending abaseline model to incorporate the target mask as an additional input. Toenforce target awareness, we introduce a special token that encodes thetarget's spatial information within the text prompt. We then fine-tune themodel with our curated dataset using a novel cross-attention loss that alignsthe cross-attention maps associated with this token with the input target mask.To further improve performance, we selectively apply this loss to the mostsemantically relevant transformer blocks and attention regions. Experimentalresults show that our target-aware model outperforms existing solutions ingenerating videos where actors interact accurately with the specified targets.We further demonstrate its efficacy in two downstream applications: videocontent creation and zero-shot 3D HOI motion synthesis.

Quick Read (beta)

loading the full paper ...