ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

Abstract

Image-to-video (I2V) generation aims to use the initial frame (alongside atext prompt) to create a video sequence. A grand challenge in I2V generation isto maintain visual consistency throughout the video: existing methods oftenstruggle to preserve the integrity of the subject, background, and style fromthe first frame, as well as ensure a fluid and logical progression within thevideo narrative. To mitigate these issues, we propose ConsistI2V, adiffusion-based method to enhance visual consistency for I2V generation.Specifically, we introduce (1) spatiotemporal attention over the first frame tomaintain spatial and motion consistency, (2) noise initialization from thelow-frequency band of the first frame to enhance layout consistency. These twoapproaches enable ConsistI2V to generate highly consistent videos. We alsoextend the proposed approaches to show their potential to improve consistencyin auto-regressive long video generation and camera motion control. To verifythe effectiveness of our method, we propose I2V-Bench, a comprehensiveevaluation benchmark for I2V generation. Our automatic and human evaluationresults demonstrate the superiority of ConsistI2V over existing methods.

Quick Read (beta)

loading the full paper ...