I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

Abstract

Text-guided image-to-video (I2V) generation aims to generate a coherent videothat preserves the identity of the input image and semantically aligns with theinput prompt. Existing methods typically augment pretrained text-to-video (T2V)models by either concatenating the image with noised video frames channel-wisebefore being fed into the model or injecting the image embedding produced bypretrained image encoders in cross-attention modules. However, the formerapproach often necessitates altering the fundamental weights of pretrained T2Vmodels, thus restricting the model's compatibility within the open-sourcecommunities and disrupting the model's prior knowledge. Meanwhile, the lattertypically fails to preserve the identity of the input image. We presentI2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates theunnoised input image to subsequent noised frames through a cross-frameattention mechanism, maintaining the identity of the input image without anychanges to the pretrained T2V model. Notably, I2V-Adapter only introduces a fewtrainable parameters, significantly alleviating the training cost and alsoensures compatibility with existing community-driven personalized models andcontrol tools. Moreover, we propose a novel Frame Similarity Prior to balancethe motion amplitude and the stability of generated videos through twoadjustable control coefficients. Our experimental results demonstrate thatI2V-Adapter is capable of producing high-quality videos. This performance,coupled with its agility and adaptability, represents a substantial advancementin the field of I2V, particularly for personalized and controllableapplications.

Quick Read (beta)

loading the full paper ...