VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Abstract

Training vision-language models (VLMs) for complex reasoning remains achallenging task, i.a. due to the scarcity of high-quality image-text reasoningdata. Conversely, text-based reasoning resources are abundant and scalable, butit is still an open question how to leveraging them for VLM reasoning. Toaddress this problem, we propose VOLD, a framework to transfer reasoningcapabilities from text-only teacher models to VLM student models. To this end,VOLD combines reinforcement learning via Group Relative Policy Optimization(GRPO) with on-policy distillation, which allows the student reasoning tracesto be guided by the teacher model, resulting in a significant gain over usingGRPO alone. We further show that a cold-start alignment is essential for aneffective transfer during the online training phase in this scenario and thatwithout sufficient distributional alignment between teacher and student,on-policy distillation fails to provide meaningful guidance. We evaluate VOLDacross diverse benchmarks including MMMU-Pro, MathVision, MathVista, andLogicVista, showing that VOLD outperforms the baseline model significantly andimproves over the state of the art by a margin. Our ablation shows theimportance of a cold-start alignment via SFT for on-policy distillation with atext-only teacher.

Quick Read (beta)

loading the full paper ...