Abstract
Vision-Language Models (VLMs) have demonstrated their widespread viabilitythanks to extensive training in aligning visual instructions to answers.However, this conclusive alignment leads models to ignore critical visualreasoning, and further result in failures on meticulous visual problems andunfaithful responses. In this paper, we propose Chain of Manipulations, amechanism that enables VLMs to solve problems with a series of manipulations,where each manipulation refers to an operation on the visual input, either fromintrinsic abilities (e.g., grounding) acquired through prior training or fromimitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMsto generate faithful responses with evidential visual reasoning, and permitsusers to trace error causes in the interpretable paths. We thus train CogCoM, ageneral 17B VLM with a memory-based compatible architecture endowed thisreasoning mechanism. Experiments show that our model achieves thestate-of-the-art performance across 8 benchmarks from 3 categories, and alimited number of training steps with the data swiftly gains a competitiveperformance. The code and data are publicly available athttps://github.com/THUDM/CogCoM.