Abstract
In recent years, image editing models have witnessed remarkable and rapiddevelopment. The recent unveiling of cutting-edge multimodal models such asGPT-4o and Gemini2 Flash has introduced highly promising image editingcapabilities. These models demonstrate an impressive aptitude for fulfilling avast majority of user-driven editing requirements, marking a significantadvancement in the field of image manipulation. However, there is still a largegap between the open-source algorithm with these closed-source models. Thus, inthis paper, we aim to release a state-of-the-art image editing model, calledStep1X-Edit, which can provide comparable performance against the closed-sourcemodels like GPT-4o and Gemini2 Flash. More specifically, we adopt theMultimodal LLM to process the reference image and the user's editinginstruction. A latent embedding has been extracted and integrated with adiffusion image decoder to obtain the target image. To train the model, webuild a data generation pipeline to produce a high-quality dataset. Forevaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-worlduser instructions. Experimental results on GEdit-Bench demonstrate thatStep1X-Edit outperforms existing open-source baselines by a substantial marginand approaches the performance of leading proprietary models, thereby makingsignificant contributions to the field of image editing.