Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Abstract

Chain-of-Thought (CoT) reasoning has been extensively explored in largemodels to tackle complex understanding tasks. However, it still remains an openquestion whether such strategies can be applied to verifying and reinforcingimage generation scenarios. In this paper, we provide the first comprehensiveinvestigation of the potential of CoT reasoning to enhance autoregressive imagegeneration. We focus on three techniques: scaling test-time computation forverification, aligning model preferences with Direct Preference Optimization(DPO), and integrating these techniques for complementary effects. Our resultsdemonstrate that these approaches can be effectively adapted and combined tosignificantly improve image generation performance. Furthermore, given thepivotal role of reward models in our findings, we propose the PotentialAssessment Reward Model (PARM) and PARM++, specialized for autoregressive imagegeneration. PARM adaptively assesses each generation step through a potentialassessment approach, merging the strengths of existing reward models, andPARM++ further introduces a reflection mechanism to self-correct the generatedunsatisfactory image. Using our investigated reasoning strategies, we enhance abaseline model, Show-o, to achieve superior results, with a significant +24%improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. Wehope our study provides unique insights and paves a new path for integratingCoT reasoning with autoregressive image generation. Code and models arereleased at https://github.com/ZiyuGuo99/Image-Generation-CoT

Quick Read (beta)

loading the full paper ...