FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling

Abstract

With the availability of large-scale video datasets and the advances ofdiffusion models, text-driven video generation has achieved substantialprogress. However, existing video generation models are typically trained on alimited number of frames, resulting in the inability to generate high-fidelitylong videos during inference. Furthermore, these models only supportsingle-text conditions, whereas real-life scenarios often require multi-textconditions as the video content changes over time. To tackle these challenges,this study explores the potential of extending the text-driven capability togenerate longer videos conditioned on multiple texts. 1) We first analyze theimpact of initial noise in video diffusion models. Then building upon theobservation of noise, we propose FreeNoise, a tuning-free and time-efficientparadigm to enhance the generative capabilities of pretrained video diffusionmodels while preserving content consistency. Specifically, instead ofinitializing noises for all frames, we reschedule a sequence of noises forlong-range correlation and perform temporal attention over them by window-basedfunction. 2) Additionally, we design a novel motion injection method to supportthe generation of videos conditioned on multiple text prompts. Extensiveexperiments validate the superiority of our paradigm in extending thegenerative capabilities of video diffusion models. It is noteworthy thatcompared with the previous best-performing method which brought about 255%extra time cost, our method incurs only negligible time cost of approximately17%. Generated video samples are available at our website:http://haonanqiu.com/projects/FreeNoise.html.

Quick Read (beta)

loading the full paper ...