Lazy Diffusion Transformer for Interactive Image Editing

Abstract

We introduce a novel diffusion transformer, LazyDiffusion, that generatespartial image updates efficiently. Our approach targets interactive imageediting applications in which, starting from a blank canvas or an image, a userspecifies a sequence of localized image modifications using binary masks andtext prompts. Our generator operates in two phases. First, a context encoderprocesses the current canvas and user mask to produce a compact global contexttailored to the region to generate. Second, conditioned on this context, adiffusion-based transformer decoder synthesizes the masked pixels in a "lazy"fashion, i.e., it only generates the masked region. This contrasts withprevious works that either regenerate the full canvas, wasting time andcomputation, or confine processing to a tight rectangular crop around the mask,ignoring the global image context altogether. Our decoder's runtime scales withthe mask size, which is typically small, while our encoder introducesnegligible overhead. We demonstrate that our approach is competitive withstate-of-the-art inpainting methods in terms of quality and fidelity whileproviding a 10x speedup for typical user interactions, where the editing maskrepresents 10% of the image.

Quick Read (beta)

loading the full paper ...