Abstract
Learning general-purpose reasoning capabilities has long been a challengingproblem in AI. Recent research in large language models (LLMs), such asDeepSeek-R1, has shown that reinforcement learning techniques like GRPO canenable pre-trained LLMs to develop reasoning capabilities using simplequestion-answer pairs. In this paper, we aim to train visual language models(VLMs) to perform reasoning on image data through reinforcement learning andvisual question-answer pairs, without any explicit chain-of-thought (CoT)supervision. Our findings indicate that simply applying reinforcement learningto a VLM -- by prompting the model to produce a reasoning chain beforeproviding an answer -- can lead the model to develop shortcuts from easyquestions, thereby reducing its ability to generalize across unseen datadistributions. We argue that the key to mitigating shortcut learning is toencourage the model to interpret images prior to reasoning. Therefore, we trainthe model to adhere to a caption-reason-answer output format: initiallygenerating a detailed caption for an image, followed by constructing anextensive reasoning chain. When trained on 273K CoT-free visual question-answerpairs and using only reinforcement learning, our model, named Visionary-R1,outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, andGemini-1.5-Pro, on multiple visual reasoning benchmarks.