LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Abstract

Reasoning is a fundamental capability for solving complex multi-stepproblems, particularly in visual contexts where sequential step-wiseunderstanding is essential. Existing approaches lack a comprehensive frameworkfor evaluating visual reasoning and do not emphasize step-wise problem-solving.To this end, we propose a comprehensive framework for advancing step-by-stepvisual reasoning in large language models (LMMs) through three keycontributions. First, we introduce a visual reasoning benchmark specificallydesigned to evaluate multi-step reasoning tasks. The benchmark presents adiverse set of challenges with eight different categories ranging from complexvisual perception to scientific reasoning with over 4k reasoning steps intotal, enabling robust evaluation of LLMs' abilities to perform accurate andinterpretable visual reasoning across multiple steps. Second, we propose anovel metric that assesses visual reasoning quality at the granularity ofindividual steps, emphasizing both correctness and logical coherence. Theproposed metric offers deeper insights into reasoning performance compared totraditional end-task accuracy metrics. Third, we present a new multimodalvisual reasoning model, named LlamaV-o1, trained using a multi-step curriculumlearning approach, where tasks are progressively organized to facilitateincremental skill acquisition and problem-solving. The proposed LlamaV-o1 isdesigned for multi-step reasoning and learns step-by-step through a structuredtraining paradigm. Extensive experiments show that our LlamaV-o1 outperformsexisting open-source models and performs favorably against close-sourceproprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves anaverage score of 67.3 with an absolute gain of 3.8\% across six benchmarkswhile being 5 times faster during inference scaling. Our benchmark, model, andcode are publicly available.

Quick Read (beta)

loading the full paper ...