Abstract
We introduce a new task -- language-driven video inpainting, which usesnatural language instructions to guide the inpainting process. This approachovercomes the limitations of traditional video inpainting methods that dependon manually labeled binary masks, a process often tedious and labor-intensive.We present the Remove Objects from Videos by Instructions (ROVI) dataset,containing 5,650 videos and 9,091 inpainting results, to support training andevaluation for this task. We also propose a novel diffusion-basedlanguage-driven video inpainting framework, the first end-to-end baseline forthis task, integrating Multimodal Large Language Models to understand andexecute complex language-based inpainting requests effectively. Ourcomprehensive results showcase the dataset's versatility and the model'seffectiveness in various language-instructed inpainting scenarios. We will makedatasets, code, and models publicly available.