GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Abstract

Editing images using natural language instructions has become a natural andexpressive way to modify visual content; yet, evaluating the performance ofsuch models remains challenging. Existing evaluation approaches often rely onimage-text similarity metrics like CLIP, which lack precision. In this work, weintroduce a new benchmark designed to evaluate text-guided image editing modelsin a more grounded manner, along two critical dimensions: (i) functionalcorrectness, assessed via automatically generated multiple-choice questionsthat verify whether the intended change was successfully applied; and (ii)image content preservation, which ensures that non-targeted regions of theimage remain visually consistent using an object-aware masking technique andpreservation scoring. The benchmark includes over 1000 high-quality editingexamples across 20 diverse content categories, each annotated with detailedediting instructions, evaluation questions, and spatial object masks. Weconduct a large-scale study comparing GPT-Image-1, the latest flagship in thetext-guided image editing space, against several state-of-the-art editingmodels, and validate our automatic metrics against human ratings. Results showthat GPT-Image-1 leads in instruction-following accuracy, but oftenover-modifies irrelevant image regions, highlighting a key trade-off in thecurrent model behavior. GIE-Bench provides a scalable, reproducible frameworkfor advancing more accurate evaluation of text-guided image editing.

Quick Read (beta)

loading the full paper ...