ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

Abstract

Text-to-video generation has made remarkable advancements through diffusionmodels. However, Multi-Concept Video Customization (MCVC) remains a significantchallenge. We identify two key challenges in this task: 1) the identitydecoupling problem, where directly adopting existing customization methodsinevitably mix attributes when handling multiple concepts simultaneously, and2) the scarcity of high-quality video-entity pairs, which is crucial fortraining such a model that represents and decouples various concepts well. Toaddress these challenges, we introduce ConceptMaster, an innovative frameworkthat effectively tackles the critical issues of identity decoupling whilemaintaining concept fidelity in customized videos. Specifically, we introduce anovel strategy of learning decoupled multi-concept embeddings that are injectedinto the diffusion models in a standalone manner, which effectively guaranteesthe quality of customized videos with multiple identities, even for highlysimilar visual concepts. To further overcome the scarcity of high-quality MCVCdata, we carefully establish a data construction pipeline, which enablessystematic collection of precise multi-concept video-entity data across diverseconcepts. A comprehensive benchmark is designed to validate the effectivenessof our model from three critical dimensions: concept fidelity, identitydecoupling ability, and video generation quality across six different conceptcomposition scenarios. Extensive experiments demonstrate that our ConceptMastersignificantly outperforms previous approaches for this task, paving the way forgenerating personalized and semantically accurate videos across multipleconcepts.

Quick Read (beta)

loading the full paper ...