Abstract
This paper introduces V$^2$Edit, a novel training-free framework forinstruction-guided video and 3D scene editing. Addressing the criticalchallenge of balancing original content preservation with editing taskfulfillment, our approach employs a progressive strategy that decomposescomplex editing tasks into a sequence of simpler subtasks. Each subtask iscontrolled through three key synergistic mechanisms: the initial noise, noiseadded at each denoising step, and cross-attention maps between text prompts andvideo content. This ensures robust preservation of original video elementswhile effectively applying the desired edits. Beyond its native video editingcapability, we extend V$^2$Edit to 3D scene editing via a"render-edit-reconstruct" process, enabling high-quality, 3D-consistent editseven for tasks involving substantial geometric changes such as objectinsertion. Extensive experiments demonstrate that our V$^2$Edit achieveshigh-quality and successful edits across various challenging video editingtasks and complex 3D scene editing tasks, thereby establishing state-of-the-artperformance in both domains.