GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Abstract

The recent breakthroughs in OpenAI's GPT4o model have demonstratedsurprisingly good capabilities in image generation and editing, resulting insignificant excitement in the community. This technical report presents thefirst-look evaluation benchmark (named GPT-ImgEval), quantitatively andqualitatively diagnosing GPT-4o's performance across three critical dimensions:(1) generation quality, (2) editing proficiency, and (3) worldknowledge-informed semantic synthesis. Across all three tasks, GPT-4odemonstrates strong performance, significantly surpassing existing methods inboth image generation control and output quality, while also showcasingexceptional knowledge reasoning capabilities. Furthermore, based on theGPT-4o's generated data, we propose a classification-model-based approach toinvestigate the underlying architecture of GPT-4o, where our empirical resultssuggest the model consists of an auto-regressive (AR) combined with adiffusion-based head for image decoding, rather than the VAR-likearchitectures. We also provide a complete speculation on GPT-4o's overallarchitecture. In addition, we conduct a series of analyses to identify andvisualize GPT-4o's specific limitations and the synthetic artifacts commonlyobserved in its image generation. We also present a comparative study ofmulti-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss thesafety implications of GPT-4o's outputs, particularly their detectability byexisting image forensic models. We hope that our work can offer valuableinsight and provide a reliable benchmark to guide future research, fosterreproducibility, and accelerate innovation in the field of image generation andbeyond. The codes and datasets used for evaluating GPT-4o can be found athttps://github.com/PicoTrex/GPT-ImgEval.

Quick Read (beta)

loading the full paper ...