Abstract
Video generation has made substantial strides with the emergence of deepgenerative models, especially diffusion-based approaches. However, videogeneration based on multiple reference subjects still faces significantchallenges in maintaining multi-subject consistency and ensuring highgeneration quality. In this paper, we propose MAGREF, a unified framework forany-reference video generation that introduces masked guidance to enablecoherent multi-subject video synthesis conditioned on diverse reference imagesand a textual prompt. Specifically, we propose (1) a region-aware dynamicmasking mechanism that enables a single model to flexibly handle varioussubject inference, including humans, objects, and backgrounds, withoutarchitectural changes, and (2) a pixel-wise channel concatenation mechanismthat operates on the channel dimension to better preserve appearance features.Our model delivers state-of-the-art video generation quality, generalizing fromsingle-subject training to complex multi-subject scenarios with coherentsynthesis and precise control over individual subjects, outperforming existingopen-source and commercial baselines. To facilitate evaluation, we alsointroduce a comprehensive multi-subject video benchmark. Extensive experimentsdemonstrate the effectiveness of our approach, paving the way for scalable,controllable, and high-fidelity multi-subject video synthesis. Code and modelcan be found at: https://github.com/MAGREF-Video/MAGREF