Abstract
Developing generalizable reasoning capabilities in multimodal large languagemodels (MLLMs) remains challenging. Motivated by cognitive science literaturesuggesting that gameplay promotes transferable cognitive skills, we propose anovel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMsdevelop out-of-domain generalization of multimodal reasoning through playingarcade-like games. Specifically, we show that post-training a 7B-parameter MLLMvia reinforcement learning (RL) on simple arcade-like games, e.g. Snake,significantly enhances its downstream performance on multimodal math benchmarkslike MathVista, and on multi-discipline questions like MMMU, without seeing anyworked solutions, equations, or diagrams during RL, suggesting the capture oftransferable reasoning skills. Remarkably, our model outperforms specialistmodels tuned on multimodal reasoning data in multimodal reasoning benchmarks,while preserving the base model's performance on general visual benchmarks, achallenge where specialist models often fall short. Our findings suggest a newpost-training paradigm: synthetic, rule-based games can serve as controllableand scalable pre-text tasks that unlock generalizable multimodal reasoningabilities in MLLMs.