Multi-subject Open-set Personalization in Video Generation

Abstract

Video personalization methods allow us to synthesize videos with specificconcepts such as people, pets, and places. However, existing methods oftenfocus on limited domains, require time-consuming optimization per subject, orsupport only a single subject. We present Video Alchemist $-$ a video modelwith built-in multi-subject, open-set personalization capabilities for bothforeground objects and background, eliminating the need for time-consumingtest-time optimization. Our model is built on a new Diffusion Transformermodule that fuses each conditional reference image and its correspondingsubject-level text prompt with cross-attention layers. Developing such a largemodel presents two main challenges: dataset and evaluation. First, as paireddatasets of reference images and videos are extremely hard to collect, wesample selected video frames as reference images and synthesize a clip of thetarget video. However, while models can easily denoise training videos givenreference frames, they fail to generalize to new contexts. To mitigate thisissue, we design a new automatic data construction pipeline with extensiveimage augmentations. Second, evaluating open-set video personalization is achallenge in itself. To address this, we introduce a personalization benchmarkthat focuses on accurate subject fidelity and supports diverse personalizationscenarios. Finally, our extensive experiments show that our methodsignificantly outperforms existing personalization methods in both quantitativeand qualitative evaluations.

Quick Read (beta)

loading the full paper ...