Abstract
Current video generation models excel at creating short, realistic clips, butstruggle with longer, multi-scene videos. We introduce \texttt{DreamFactory},an LLM-based framework that tackles this challenge. \texttt{DreamFactory}leverages multi-agent collaboration principles and a Key Frames IterationDesign Method to ensure consistency and style across long videos. It utilizesChain of Thought (COT) to address uncertainties inherent in large languagemodels. \texttt{DreamFactory} generates long, stylistically coherent, andcomplex videos. Evaluating these long-form videos presents a challenge. Wepropose novel metrics such as Cross-Scene Face Distance Score and Cross-SceneStyle Consistency Score. To further research in this area, we contribute theMulti-Scene Videos Dataset containing over 150 human-rated videos.