Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Abstract

When presented with questions involving visual thinking, humans naturallyswitch reasoning modalities, often forming mental images or drawing visualaids. Large language models have shown promising results in arithmetic andsymbolic reasoning by expressing intermediate reasoning in text as a chain ofthought, yet struggle to extend this capability to answer text queries that areeasily solved by visual reasoning, even with extensive multimodal pretraining.We introduce a simple method, whiteboard-of-thought prompting, to unlock thevisual reasoning capabilities of multimodal large language models acrossmodalities. Whiteboard-of-thought prompting provides multimodal large languagemodels with a metaphorical `whiteboard' to draw out reasoning steps as images,then returns these images back to the model for further processing. We findthis can be accomplished with no demonstrations or specialized modules, insteadleveraging models' existing ability to write code with libraries such asMatplotlib and Turtle. This simple approach shows state-of-the-art results onfour difficult natural language tasks that involve visual and spatialreasoning. We identify multiple settings where GPT-4o using chain-of-thoughtfails dramatically, including more than one where it achieves $0\%$ accuracy,while whiteboard-of-thought enables up to $92\%$ accuracy in these samesettings. We present a detailed exploration of where the technique succeeds aswell as its sources of error.

Quick Read (beta)

loading the full paper ...