ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Abstract

Although chain-of-thought reasoning and reinforcement learning (RL) havedriven breakthroughs in NLP, their integration into generative vision modelsremains underexplored. We introduce ReasonGen-R1, a two-stage framework thatfirst imbues an autoregressive image generator with explicit text-based"thinking" skills via supervised fine-tuning on a newly generated reasoningdataset of written rationales, and then refines its outputs using GroupRelative Policy Optimization. To enable the model to reason through text beforegenerating images, We automatically generate and release a corpus of modelcrafted rationales paired with visual prompts, enabling controlled planning ofobject layouts, styles, and scene compositions. Our GRPO algorithm uses rewardsignals from a pretrained vision language model to assess overall visualquality, optimizing the policy in each update. Evaluations on GenEval, DPG, andthe T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strongbaselines and prior state-of-the-art models. More: aka.ms/reasongen.

Quick Read (beta)

loading the full paper ...