UNIC: Unified In-Context Video Editing

Abstract

Recent advances in text-to-video generation have sparked interest ingenerative video editing tasks. Previous methods often rely on task-specificarchitectures (e.g., additional adapter modules) or dedicated customizations(e.g., DDIM inversion), which limit the integration of versatile editingconditions and the unification of various editing tasks. In this paper, weintroduce UNified In-Context Video Editing (UNIC), a simple yet effectiveframework that unifies diverse video editing tasks within a single model in anin-context manner. To achieve this unification, we represent the inputs ofvarious video editing tasks as three types of tokens: the source video tokens,the noisy video latent, and the multi-modal conditioning tokens that varyaccording to the specific editing task. Based on this formulation, our keyinsight is to integrate these three types into a single consecutive tokensequence and jointly model them using the native attention operations of DiT,thereby eliminating the need for task-specific adapter designs. Nevertheless,direct task unification under this framework is challenging, leading to severetoken collisions and task confusion due to the varying video lengths anddiverse condition modalities across tasks. To address these, we introducetask-aware RoPE to facilitate consistent temporal positional encoding, andcondition bias that enables the model to clearly differentiate differentediting tasks. This allows our approach to adaptively perform different videoediting tasks by referring the source video and varying condition tokens "incontext", and support flexible task composition. To validate our method, weconstruct a unified video editing benchmark containing six representative videoediting tasks. Results demonstrate that our unified approach achieves superiorperformance on each task and exhibits emergent task composition abilities.

Quick Read (beta)

loading the full paper ...