RepVideo: Rethinking Cross-Layer Representation for Video Generation

Abstract

Video generation has achieved remarkable progress with the introduction ofdiffusion models, which have significantly improved the quality of generatedvideos. However, recent research has primarily focused on scaling up modeltraining, while offering limited insights into the direct impact ofrepresentations on the video generation process. In this paper, we initiallyinvestigate the characteristics of features in intermediate layers, findingsubstantial variations in attention maps across different layers. Thesevariations lead to unstable semantic representations and contribute tocumulative differences between features, which ultimately reduce the similaritybetween adjacent frames and negatively affect temporal coherence. To addressthis, we propose RepVideo, an enhanced representation framework fortext-to-video diffusion models. By accumulating features from neighboringlayers to form enriched representations, this approach captures more stablesemantic information. These enhanced representations are then used as inputs tothe attention mechanism, thereby improving semantic expressiveness whileensuring feature consistency across adjacent frames. Extensive experimentsdemonstrate that our RepVideo not only significantly enhances the ability togenerate accurate spatial appearances, such as capturing complex spatialrelationships between multiple objects, but also improves temporal consistencyin video generation.

Quick Read (beta)

loading the full paper ...