Abstract
In this paper, we introduce knowledge image generation as a new task,alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image GenerationBenchmark (MMMG) to probe the reasoning capability of image generation models.Knowledge images have been central to human civilization and to the mechanismsof human learning--a fact underscored by dual-coding theory and thepicture-superiority effect. Generating such images is challenging, demandingmultimodal reasoning that fuses world knowledge with pixel-level grounding intoclear explanatory visuals. To enable comprehensive evaluation, MMMG offers4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines,6 educational levels, and diverse knowledge formats such as charts, diagrams,and mind maps. To eliminate confounding complexity during evaluation, we adopta unified Knowledge Graph (KG) representation. Each KG explicitly delineates atarget image's core entities and their dependencies. We further introduceMMMG-Score to evaluate generated knowledge images. This metric combines factualfidelity, measured by graph-edit distance between KGs, with visual clarityassessment. Comprehensive evaluations of 16 state-of-the-art text-to-imagegeneration models expose serious reasoning deficits--low entity fidelity, weakrelations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20,underscoring the benchmark's difficulty. To spur further progress, we releaseFLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combinesa reasoning LLM with diffusion models and is trained on 16,000 curatedknowledge image-prompt pairs.