VidToMe: Video Token Merging for Zero-Shot Video Editing

Abstract

Diffusion models have made significant advances in generating high-qualityimages, but their application to video generation has remained challenging dueto the complexity of temporal motion. Zero-shot video editing offers a solutionby utilizing pre-trained image diffusion models to translate source videos intonew ones. Nevertheless, existing methods struggle to maintain strict temporalconsistency and efficient memory consumption. In this work, we propose a novelapproach to enhance temporal consistency in generated videos by mergingself-attention tokens across frames. By aligning and compressing temporallyredundant tokens across frames, our method improves temporal coherence andreduces memory consumption in self-attention computations. The merging strategymatches and aligns tokens according to the temporal correspondence betweenframes, facilitating natural temporal consistency in generated video frames. Tomanage the complexity of video processing, we divide videos into chunks anddevelop intra-chunk local token merging and inter-chunk global token merging,ensuring both short-term video continuity and long-term content consistency.Our video editing approach seamlessly extends the advancements in image editingto video editing, rendering favorable results in temporal consistency overstate-of-the-art methods.

Quick Read (beta)

loading the full paper ...