OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Abstract

Recent advancements demonstrated by DeepSeek-R1 have shown that complexreasoning abilities in large language models (LLMs), including sophisticatedbehaviors such as self-verification and self-correction, can be achieved by RLwith verifiable rewards and significantly improves model performance onchallenging tasks such as AIME. Motivated by these findings, our studyinvestigates whether similar reasoning capabilities can be successfullyintegrated into large vision-language models (LVLMs) and assesses their impacton challenging multimodal reasoning tasks. We consider an approach thatiteratively leverages supervised fine-tuning (SFT) on lightweight training dataand Reinforcement Learning (RL) to further improve model generalization.Initially, reasoning capabilities were distilled from pure-text R1 models bygenerating reasoning steps using high-quality captions of the images sourcedfrom diverse visual datasets. Subsequently, iterative RL training furtherenhance reasoning skills, with each iteration's RL-improved model generatingrefined SFT datasets for the next round. This iterative process yieldedOpenVLThinker, a LVLM exhibiting consistently improved reasoning performance onchallenging benchmarks such as MathVista, MathVerse, and MathVision,demonstrating the potential of our strategy for robust vision-languagereasoning. The code, model and data are held athttps://github.com/yihedeng9/OpenVLThinker.

Quick Read (beta)

loading the full paper ...