Abstract
Recent advances in image editing, driven by image diffusion models, haveshown remarkable progress. However, significant challenges remain, as thesemodels often struggle to follow complex edit instructions accurately andfrequently compromise fidelity by altering key elements of the original image.Simultaneously, video generation has made remarkable strides, with models thateffectively function as consistent and continuous world simulators. In thispaper, we propose merging these two fields by utilizing image-to-video modelsfor image editing. We reformulate image editing as a temporal process, usingpretrained video models to create smooth transitions from the original image tothe desired edit. This approach traverses the image manifold continuously,ensuring consistent edits while preserving the original image's key aspects.Our approach achieves state-of-the-art results on text-based image editing,demonstrating significant improvements in both edit accuracy and imagepreservation. Visit our project page athttps://rotsteinnoam.github.io/Frame2Frame.