Abstract
Video personalization methods allow us to synthesize videos with specificconcepts such as people, pets, and places. However, existing methods oftenfocus on limited domains, require time-consuming optimization per subject, orsupport only a single subject. We present Video Alchemist $-$ a video modelwith built-in multi-subject, open-set personalization capabilities for bothforeground objects and background, eliminating the need for time-consumingtest-time optimization. Our model is built on a new Diffusion Transformermodule that fuses each conditional reference image and its correspondingsubject-level text prompt with cross-attention layers. Developing such a largemodel presents two main challenges: dataset and evaluation. First, as paireddatasets of reference images and videos are extremely hard to collect, wesample selected video frames as reference images and synthesize a clip of thetarget video. However, while models can easily denoise training videos givenreference frames, they fail to generalize to new contexts. To mitigate thisissue, we design a new automatic data construction pipeline with extensiveimage augmentations. Second, evaluating open-set video personalization is achallenge in itself. To address this, we introduce a personalization benchmarkthat focuses on accurate subject fidelity and supports diverse personalizationscenarios. Finally, our extensive experiments show that our methodsignificantly outperforms existing personalization methods in both quantitativeand qualitative evaluations.