StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Abstract

Generative models are reshaping the live-streaming industry by redefining howcontent is created, styled, and delivered. Previous image-based streamingdiffusion models have powered efficient and creative live streaming productsbut have hit limits on temporal consistency due to the foundation ofimage-based designs. Recent advances in video diffusion have markedly improvedtemporal consistency and sampling efficiency for offline generation. However,offline generation systems primarily optimize throughput by batching largeworkloads. In contrast, live online streaming operates under strictservice-level objectives (SLOs): time-to-first-frame must be minimal, and everyframe must meet a per-frame deadline with low jitter. Besides, scalablemulti-GPU serving for real-time streams remains largely unresolved so far. Toaddress this, we present StreamDiffusionV2, a training-free pipeline forinteractive live streaming with video diffusion models. StreamDiffusionV2integrates an SLO-aware batching scheduler and a block scheduler, together witha sink-token--guided rolling KV cache, a motion-aware noise controller, andother system-level optimizations. Moreover, we introduce a scalable pipelineorchestration that parallelizes the diffusion process across denoising stepsand network layers, achieving near-linear FPS scaling without violating latencyguarantees. The system scales seamlessly across heterogeneous GPU environmentsand supports flexible denoising steps (e.g., 1--4), enabling bothultra-low-latency and higher-quality modes. Without TensorRT or quantization,StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPSwith a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on fourH100 GPUs, making state-of-the-art generative live streaming practical andaccessible--from individual creators to enterprise-scale platforms.

Quick Read (beta)

loading the full paper ...