MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

Abstract

A key frontier for Multimodal Large Language Models (MLLMs) is the ability toperform deep mathematical and spatial reasoning directly from images, movingbeyond their established success in semantic description. Mathematical surfaceplots provide a rigorous testbed for this capability, as they isolate the taskof reasoning from the semantic noise common in natural images. To measureprogress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning overVisual Landscapes), a new benchmark designed to quantitatively evaluate thesecore reasoning skills. The benchmark comprises two novel tasks: TopologicalCounting, identifying and enumerating features like local maxima; andTransformation Recognition, recognizing applied geometric transformations.Generated from a curated library of functions with rigorous ambiguityfiltering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMsstruggle significantly, often resorting to superficial heuristics instead ofrobust spatial reasoning. MaRVL-QA provides a challenging new tool for theresearch community to measure progress, expose model limitations, and guide thedevelopment of MLLMs with more profound reasoning abilities.

Quick Read (beta)

loading the full paper ...