MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Abstract

Answering questions with Chain-of-Thought (CoT) has significantly enhancedthe reasoning capabilities of Large Language Models (LLMs), yet its impact onLarge Multimodal Models (LMMs) still lacks a systematic assessment and in-depthinvestigation. In this paper, we introduce MME-CoT, a specialized benchmarkevaluating the CoT reasoning performance of LMMs, spanning six domains: math,science, OCR, logic, space-time, and general scenes. As the first comprehensivestudy in this area, we propose a thorough evaluation suite incorporating threenovel metrics that assess the reasoning quality, robustness, and efficiency ata fine-grained level. Leveraging curated high-quality data and a uniqueevaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,uncovering several key insights: 1) Models with reflection mechanismdemonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o anddemonstrating the highest quality results; 2) CoT prompting often degrades LMMperformance on perception-heavy tasks, suggesting a potentially harmfuloverthinking behavior; and 3) Although the CoT quality is high, LMMs withreflection exhibit significant inefficiency in both normal response andself-correction phases. We hope MME-CoT serves as a foundation for advancingmultimodal reasoning in LMMs. Project Page: https://mmecot.github.io/

Quick Read (beta)

loading the full paper ...