Abstract
Despite recent advances in UNet-based image editing, methods for shape-awareobject editing in high-resolution images are still lacking. Compared to UNet,Diffusion Transformers (DiT) demonstrate superior capabilities to effectivelycapture the long-range dependencies among patches, leading to higher-qualityimage generation. In this paper, we propose DiT4Edit, the first DiffusionTransformer-based image editing framework. Specifically, DiT4Edit uses theDPM-Solver inversion algorithm to obtain the inverted latents, reducing thenumber of steps compared to the DDIM inversion algorithm commonly used inUNet-based frameworks. Additionally, we design unified attention control andpatches merging, tailored for transformer computation streams. This integrationallows our framework to generate higher-quality edited images faster. Ourdesign leverages the advantages of DiT, enabling it to surpass UNet structuresin image editing, especially in high-resolution and arbitrary-size images.Extensive experiments demonstrate the strong performance of DiT4Edit acrossvarious editing scenarios, highlighting the potential of Diffusion Transformersin supporting image editing.