AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Abstract

Text-to-Video (T2V) generation has attracted significant attention for itsability to synthesize realistic videos from textual descriptions. However,existing models struggle to balance computational efficiency and high visualquality, particularly on resource-limited devices, e.g.,iGPUs and mobilephones. Most prior work prioritizes visual fidelity while overlooking the needfor smaller, more efficient models suitable for real-world deployment. Toaddress this challenge, we propose a lightweight T2V framework, termedHummingbird, which prunes existing models and enhances visual quality throughvisual feedback learning. Our approach reduces the size of the U-Net from 1.4billion to 0.7 billion parameters, significantly improving efficiency whilepreserving high-quality video generation. Additionally, we introduce a noveldata processing pipeline that leverages Large Language Models (LLMs) and VideoQuality Assessment (VQA) models to enhance the quality of both text prompts andvideo data. To support user-driven training and style customization, wepublicly release the full training code, including data processing and modeltraining. Extensive experiments show that our method achieves a 31X speedupcompared to state-of-the-art models such as VideoCrafter2, while also attainingthe highest overall score on VBench. Moreover, our method supports thegeneration of videos with up to 26 frames, addressing the limitations ofexisting U-Net-based methods in long video generation. Notably, the entiretraining process requires only four GPUs, yet delivers performance competitivewith existing leading methods. Hummingbird presents a practical and efficientsolution for T2V generation, combining high performance, scalability, andflexibility for real-world applications.

Quick Read (beta)

loading the full paper ...