Abstract
We present LayerFlow, a unified solution for layer-aware video generation.Given per-layer prompts, LayerFlow generates videos for the transparentforeground, clean background, and blended scene. It also supports versatilevariants like decomposing a blended video or generating the background for thegiven foreground and vice versa. Starting from a text-to-video diffusiontransformer, we organize the videos for different layers as sub-clips, andleverage layer embeddings to distinguish each clip and the correspondinglayer-wise prompts. In this way, we seamlessly support the aforementionedvariants in one unified framework. For the lack of high-quality layer-wisetraining videos, we design a multi-stage training strategy to accommodatestatic images with high-quality layer annotations. Specifically, we first trainthe model with low-quality video data. Then, we tune a motion LoRA to make themodel compatible with static frames. Afterward, we train the content LoRA onthe mixture of image data with high-quality layered images along withcopy-pasted video data. During inference, we remove the motion LoRA thusgenerating smooth videos with desired layers.