Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Abstract

Text-to-image (T2I) diffusion models achieve state-of-the-art results inimage synthesis and editing. However, leveraging such pretrained models forvideo editing is considered a major challenge. Many existing works attempt toenforce temporal consistency in the edited video through explicitcorrespondence mechanisms, either in pixel space or between deep features.These methods, however, struggle with strong nonrigid motion. In this paper, weintroduce a fundamentally different approach, which is based on the observationthat spatiotemporal slices of natural videos exhibit similar characteristics tonatural images. Thus, the same T2I diffusion model that is normally used onlyas a prior on video frames, can also serve as a strong prior for enhancingtemporal consistency by applying it on spatiotemporal slices. Based on thisobservation, we present Slicedit, a method for text-based video editing thatutilizes a pretrained T2I diffusion model to process both spatial andspatiotemporal slices. Our method generates videos that retain the structureand motion of the original video while adhering to the target text. Throughextensive experiments, we demonstrate Slicedit's ability to edit a wide rangeof real-world videos, confirming its clear advantages compared to existingcompeting methods. Webpage: https://matankleiner.github.io/slicedit/

Quick Read (beta)

loading the full paper ...