Abstract
We present Emu Video, a text-to-video generation model that factorizes thegeneration into two steps: first generating an image conditioned on the text,and then generating a video conditioned on the text and the generated image. Weidentify critical design decisions--adjusted noise schedules for diffusion, andmulti-stage training--that enable us to directly generate high quality and highresolution videos, without requiring a deep cascade of models as in prior work.In human evaluations, our generated videos are strongly preferred in qualitycompared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia'sPYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercialsolutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizingapproach naturally lends itself to animating images based on a user's textprompt, where our generations are preferred 96% over prior work.