Abstract
The recent breakthroughs in OpenAI's GPT4o model have demonstratedsurprisingly good capabilities in image generation and editing, resulting insignificant excitement in the community. This technical report presents thefirst-look evaluation benchmark (named GPT-ImgEval), quantitatively andqualitatively diagnosing GPT-4o's performance across three critical dimensions:(1) generation quality, (2) editing proficiency, and (3) worldknowledge-informed semantic synthesis. Across all three tasks, GPT-4odemonstrates strong performance, significantly surpassing existing methods inboth image generation control and output quality, while also showcasingexceptional knowledge reasoning capabilities. Furthermore, based on theGPT-4o's generated data, we propose a classification-model-based approach toinvestigate the underlying architecture of GPT-4o, where our empirical resultssuggest the model consists of an auto-regressive (AR) combined with adiffusion-based head for image decoding, rather than the VAR-likearchitectures. We also provide a complete speculation on GPT-4o's overallarchitecture. In addition, we conduct a series of analyses to identify andvisualize GPT-4o's specific limitations and the synthetic artifacts commonlyobserved in its image generation. We also present a comparative study ofmulti-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss thesafety implications of GPT-4o's outputs, particularly their detectability byexisting image forensic models. We hope that our work can offer valuableinsight and provide a reliable benchmark to guide future research, fosterreproducibility, and accelerate innovation in the field of image generation andbeyond. The codes and datasets used for evaluating GPT-4o can be found athttps://github.com/PicoTrex/GPT-ImgEval.