Abstract
With the scale capability of increasing training data, model size, andcomputational cost, video generation has achieved impressive results in digitalcreation, enabling users to express creativity across various domains.Recently, researchers in Large Language Models (LLMs) have expanded the scalingto test-time, which can significantly improve LLM performance by using moreinference-time computation. Instead of scaling up video foundation modelsthrough expensive training costs, we explore the power of Test-Time Scaling(TTS) in video generation, aiming to answer the question: if a video generationmodel is allowed to use non-trivial amount of inference-time compute, how muchcan it improve generation quality given a challenging text prompt. In thiswork, we reinterpret the test-time scaling of video generation as a searchingproblem to sample better trajectories from Gaussian noise space to the targetvideo distribution. Specifically, we build the search space with test-timeverifiers to provide feedback and heuristic algorithms to guide searchingprocess. Given a text prompt, we first explore an intuitive linear searchstrategy by increasing noise candidates at inference time. As full-stepdenoising all frames simultaneously requires heavy test-time computation costs,we further design a more efficient TTS method for video generation calledTree-of-Frames (ToF) that adaptively expands and prunes video branches in anautoregressive manner. Extensive experiments on text-conditioned videogeneration benchmarks demonstrate that increasing test-time computeconsistently leads to significant improvements in the quality of videos.Project page: https://liuff19.github.io/Video-T1