Towards Language-Driven Video Inpainting via Multimodal Large Language Models

  • 2024-10-01 06:58:37
  • Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy
  • 0

Abstract

We introduce a new task -- language-driven video inpainting, which usesnatural language instructions to guide the inpainting process. This approachovercomes the limitations of traditional video inpainting methods that dependon manually labeled binary masks, a process often tedious and labor-intensive.We present the Remove Objects from Videos by Instructions (ROVI) dataset,containing 5,650 videos and 9,091 inpainting results, to support training andevaluation for this task. We also propose a novel diffusion-basedlanguage-driven video inpainting framework, the first end-to-end baseline forthis task, integrating Multimodal Large Language Models to understand andexecute complex language-based inpainting requests effectively. Ourcomprehensive results showcase the dataset's versatility and the model'seffectiveness in various language-instructed inpainting scenarios. We will makedatasets, code, and models publicly available.

 

Quick Read (beta)

loading the full paper ...