UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

  • 2025-10-21 14:56:46
  • Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
  • 0

Abstract

Recent progress in text-to-image (T2I) generation underscores the importanceof reliable benchmarks in evaluating how accurately generated images reflectthe semantics of their textual prompt. However, (1) existing benchmarks lackthe diversity of prompt scenarios and multilingual support, both essential forreal-world applicability; (2) they offer only coarse evaluations across primarydimensions, covering a narrow range of sub-dimensions, and fall short infine-grained sub-dimension assessment. To address these limitations, weintroduce UniGenBench++, a unified semantic assessment benchmark for T2Igeneration. Specifically, it comprises 600 prompts organized hierarchically toensure both coverage and efficiency: (1) spans across diverse real-worldscenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensivelyprobes T2I models' semantic consistency over 10 primary and 27 sub evaluationcriteria, with each prompt assessing multiple testpoints. To rigorously assessmodel robustness to variations in language and prompt length, we provide bothEnglish and Chinese versions of each prompt in short and long forms. Leveragingthe general world knowledge and fine-grained image understanding capabilitiesof a closed-source Multi-modal Large Language Model (MLLM), i.e.,Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmarkconstruction and streamlined model assessment. Moreover, to further facilitatecommunity use, we train a robust evaluation model that enables offlineassessment of T2I model outputs. Through comprehensive benchmarking of bothopen- and closed-sourced T2I models, we systematically reveal their strengthsand weaknesses across various aspects.

 

Quick Read (beta)

loading the full paper ...