Abstract
Despite the fact that text-to-video (TTV) model has recently achievedremarkable success, there have been few approaches on TTV for its extension tovideo editing. Motivated by approaches on TTV models adapting fromdiffusion-based text-to-image (TTI) models, we suggest the video editingframework given only a pretrained TTI model and a single <text, video> pair,which we term Edit-A-Video. The framework consists of two stages: (1) inflatingthe 2D model into the 3D model by appending temporal modules and tuning on thesource video (2) inverting the source video into the noise and editing withtarget text prompt and attention map injection. Each stage enables the temporalmodeling and preservation of semantic attributes of the source video. One ofthe key challenges for video editing include a background inconsistencyproblem, where the regions not included for the edit suffer from undesirableand inconsistent temporal alterations. To mitigate this issue, we alsointroduce a novel mask blending method, termed as sparse-causal blending (SCBlending). We improve previous mask blending methods to reflect the temporalconsistency so that the area where the editing is applied exhibits smoothtransition while also achieving spatio-temporal consistency of the uneditedregions. We present extensive experimental results over various types of textand videos, and demonstrate the superiority of the proposed method compared tobaselines in terms of background consistency, text alignment, and video editingquality.