Abstract
While large multi-modal models (LMMs) have exhibited impressive capabilitiesacross diverse tasks, their effectiveness in handling complex tasks has beenlimited by the prevailing single-step reasoning paradigm. To this end, thispaper proposes VoCoT, a multi-step Visually grounded object-centricChain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT ischaracterized by two key features: (1) object-centric reasoning paths thatrevolve around cross-modal shared object-level information, and (2) visuallygrounded representation of object concepts in a multi-modal interleaved andaligned manner, which effectively bridges the modality gap within LMMs duringlong-term generation. Additionally, we construct an instruction dataset tofacilitate LMMs in adapting to reasoning with VoCoT. By introducing VoCoT intothe prevalent open-source LMM architecture, we introduce VolCano. With only 7Bparameters and limited input resolution, VolCano demonstrates excellentperformance across various scenarios, surpassing SOTA models, including GPT-4V,in tasks requiring complex reasoning. Our code, data and model will beavailable at https://github.com/RupertLuo/VoCoT.