Magic-Me: Identity-Specific Video Customized Diffusion

Abstract

Creating content with specified identities (ID) has attracted significantinterest in the field of generative models. In the field of text-to-imagegeneration (T2I), subject-driven creation has achieved great progress with theidentity controlled via reference images. However, its extension to videogeneration is not well explored. In this work, we propose a simple yeteffective subject identity controllable video generation framework, termedVideo Custom Diffusion (VCD). With a specified identity defined by a fewimages, VCD reinforces the identity characteristics and injects frame-wisecorrelation at the initialization stage for stable video outputs. To achievethis, we propose three novel components that are essential for high-qualityidentity preservation and stable video generation: 1) a noise initializationmethod with 3D Gaussian Noise Prior for better inter-frame stability; 2) an IDmodule based on extended Textual Inversion trained with the cropped identity todisentangle the ID information from the background 3) Face VCD and Tiled VCDmodules to reinforce faces and upscale the video to higher resolution whilepreserving the identity's features. We conducted extensive experiments toverify that VCD is able to generate stable videos with better ID over thebaselines. Besides, with the transferability of the encoded identity in the IDmodule, VCD is also working well with personalized text-to-image modelsavailable publicly. The codes are available athttps://github.com/Zhen-Dong/Magic-Me.

Quick Read (beta)

loading the full paper ...