Abstract
Controllability, temporal coherence, and detail synthesis remain the mostcritical challenges in video generation. In this paper, we focus on a commonlyused yet underexplored cinematic technique known as Frame In and Frame Out.Specifically, starting from image-to-video generation, users can control theobjects in the image to naturally leave the scene or provide breaking newidentity references to enter the scene, guided by user-specified motiontrajectory. To support this task, we introduce a new dataset curatedsemi-automatically, a comprehensive evaluation protocol targeting this setting,and an efficient identity-preserving motion-controllable video DiffusionTransformer architecture. Our evaluation shows that our proposed approachsignificantly outperforms existing baselines.