Abstract
In this paper, we present MovieFactory, a powerful framework to generatecinematic-picture (3072$\times$1280), film-style (multi-scene), andmulti-modality (sounding) movies on the demand of natural languages. As thefirst fully automated movie generation model to the best of our knowledge, ourapproach empowers users to create captivating movies with smooth transitionsusing simple text inputs, surpassing existing methods that produce soundlessvideos limited to a single scene of modest quality. To facilitate thisdistinctive functionality, we leverage ChatGPT to expand user-provided textinto detailed sequential scripts for movie generation. Then we bring scripts tolife visually and acoustically through vision generation and audio retrieval.To generate videos, we extend the capabilities of a pretrained text-to-imagediffusion model through a two-stage process. Firstly, we employ spatialfinetuning to bridge the gap between the pretrained image model and the newvideo dataset. Subsequently, we introduce temporal learning to capture objectmotion. In terms of audio, we leverage sophisticated retrieval models to selectand align audio elements that correspond to the plot and visual content of themovie. Extensive experiments demonstrate that our MovieFactory produces movieswith realistic visuals, diverse scenes, and seamlessly fitting audio, offeringusers a novel and immersive experience. Generated samples can be found inYouTube or Bilibili (1080P).