Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Abstract

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance LargeLanguage Models (LLMs) by decomposing complex tasks into simpler, sequentialsubtasks. However, extending CoT to vision-language reasoning tasks remainschallenging, as it often requires interpreting transitions of visual states tosupport reasoning. Existing methods often struggle with this due to limitedcapacity of modeling visual state transitions or incoherent visual trajectoriescaused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thoughtframework that enables coherent and grounded multimodal reasoning within asingle unified model. The key idea is to leverage a model capable of both imageunderstanding and generation to reason over visual content and model evolvingvisual states. However, empowering a unified model to achieve that isnon-trivial, given the high computational cost and the burden of training. Toaddress this, Uni-CoT introduces a novel two-level reasoning paradigm: AMacro-Level CoT for high-level task planning and A Micro-Level CoT for subtaskexecution. This design significantly reduces the computational overhead.Furthermore, we introduce a structured training paradigm that combinesinterleaved image-text supervision for macro-level CoT with multi-taskobjectives for micro-level CoT. Together, these innovations allow Uni-CoT toperform scalable and coherent multi-modal reasoning. Furthermore, thanks to ourdesign, all experiments can be efficiently completed using only 8 A100 GPUswith 80GB VRAM each. Experimental results on reasoning-driven image generationbenchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoTdemonstrates SOTA performance and strong generalization, establishing Uni-CoTas a promising solution for multi-modal reasoning. Project Page and Code:https://sais-fuxi.github.io/projects/uni-cot/

Quick Read (beta)

loading the full paper ...