Abstract
Large Language Models have demonstrated remarkable reasoning capability incomplex textual tasks. However, multimodal reasoning, which requiresintegrating visual and textual information, remains a significant challenge.Existing visual-language models often struggle to effectively analyze andreason visual content, resulting in suboptimal performance on complex reasoningtasks. Moreover, the absence of comprehensive benchmarks hinders the accurateassessment of multimodal reasoning capabilities. In this paper, we introduceR1-Onevision, a multimodal reasoning model designed to bridge the gap betweenvisual perception and deep reasoning. To achieve this, we propose a cross-modalreasoning pipeline that transforms images into formal textural representations,enabling precise language-based reasoning. Leveraging this pipeline, weconstruct the R1-Onevision dataset which provides detailed, step-by-stepmultimodal reasoning annotations across diverse domains. We further develop theR1-Onevision model through supervised fine-tuning and reinforcement learning tocultivate advanced reasoning and robust generalization abilities. Tocomprehensively evaluate multimodal reasoning performance across differentgrades, we introduce R1-Onevision-Bench, a benchmark aligned with humaneducational stages, covering exams from junior high school to university andbeyond. Experimental results show that R1-Onevision achieves state-of-the-artperformance, outperforming models such as GPT-4o and Qwen2.5-VL on multiplechallenging multimodal reasoning benchmarks.