Abstract
We introduce InVi, an approach for inserting or replacing objects withinvideos (referred to as inpainting) using off-the-shelf, text-to-image latentdiffusion models. InVi targets controlled manipulation of objects and blendingthem seamlessly into a background video unlike existing video editing methodsthat focus on comprehensive re-styling or entire scene alterations. To achievethis goal, we tackle two key challenges. Firstly, for high quality control andblending, we employ a two-step process involving inpainting and matching. Thisprocess begins with inserting the object into a single frame using aControlNet-based inpainting diffusion model, and then generating subsequentframes conditioned on features from an inpainted frame as an anchor to minimizethe domain gap between the background and the object. Secondly, to ensuretemporal coherence, we replace the diffusion model's self-attention layers withextended-attention layers. The anchor frame features serve as the keys andvalues for these layers, enhancing consistency across frames. Our approachremoves the need for video-specific fine-tuning, presenting an efficient andadaptable solution. Experimental results demonstrate that InVi achievesrealistic object insertion with consistent blending and coherence acrossframes, outperforming existing methods.