Abstract
Recent advancements in image-to-video (I2V) generation have shown promisingperformance in conventional scenarios. However, these methods still encountersignificant challenges when dealing with complex scenes that require a deepunderstanding of nuanced motion and intricate object-action relationships. Toaddress these challenges, we present Dynamic-I2V, an innovative framework thatintegrates Multimodal Large Language Models (MLLMs) to jointly encode visualand textual conditions for a diffusion transformer (DiT) architecture. Byleveraging the advanced multimodal understanding capabilities of MLLMs, ourmodel significantly improves motion controllability and temporal coherence insynthesized videos. The inherent multimodality of Dynamic-I2V further enablesflexible support for diverse conditional inputs, extending its applicability tovarious downstream generation tasks. Through systematic analysis, we identify acritical limitation in current I2V benchmarks: a significant bias towardsfavoring low-dynamic videos, stemming from an inadequate balance between motioncomplexity and visual quality metrics. To resolve this evaluation gap, wepropose DIVE - a novel assessment benchmark specifically designed forcomprehensive dynamic quality measurement in I2V generation. In conclusion,extensive quantitative and qualitative experiments confirm that Dynamic-I2Vattains state-of-the-art performance in image-to-video generation, particularlyrevealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range,controllability, and quality, respectively, as assessed by the DIVE metric incomparison to existing methods.