Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Abstract

Chain-of-Thought (CoT) prompting has proven highly effective for enhancingcomplex reasoning in Large Language Models (LLMs) and Multimodal Large LanguageModels (MLLMs). Yet, it struggles in complex spatial reasoning tasks.Nonetheless, human cognition extends beyond language alone, enabling theremarkable capability to think in both words and images. Inspired by thismechanism, we propose a new reasoning paradigm, MultimodalVisualization-of-Thought (MVoT). It enables visual thinking in MLLMs bygenerating image visualizations of their reasoning traces. To ensurehigh-quality visualization, we introduce token discrepancy loss intoautoregressive MLLMs. This innovation significantly improves both visualcoherence and fidelity. We validate this approach through several dynamicspatial reasoning tasks. Experimental results reveal that MVoT demonstratescompetitive performance across tasks. Moreover, it exhibits robust and reliableimprovements in the most challenging scenarios where CoT fails. Ultimately,MVoT establishes new possibilities for complex reasoning tasks where visualthinking can effectively complement verbal reasoning.

Quick Read (beta)

loading the full paper ...