Abstract
Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unifyvisual comprehension and generation. However, these two capabilities remainlargely independent, as if they are two separate functions encapsulated withinthe same model. Consequently, visual comprehension does not enhance visualgeneration, and the reasoning mechanisms of LLMs have not been fully integratedto revolutionize image generation. In this paper, we propose to enable thecollaborative co-evolution of visual comprehension and generation, advancingimage generation into an iterative introspective process. We introduce atwo-stage training approach: supervised fine-tuning teaches the MLLM with thefoundational ability to generate genuine CoT for visual generation, whilereinforcement learning activates its full potential via anexploration-exploitation trade-off. Ultimately, we unlock the Aha moment invisual generation, advancing MLLMs from text-to-image tasks to unified imagegeneration. Extensive experiments demonstrate that our model not only excels intext-to-image generation and image editing, but also functions as a superiorimage semantic evaluator with enhanced visual comprehension capabilities.Project Page: https://janus-pro-r1.github.io.