Abstract
Vision language models (VLMs) are increasingly deployed as controllers withaccess to external tools for complex reasoning and decision-making, yet theireffectiveness remains limited by the scarcity of high-quality multimodaltrajectories and the cost of manual annotation. We address this challenge witha vision-centric agent tuning framework that automatically synthesizesmultimodal trajectories, generates step-wise preference pairs, and trains a VLMcontroller for robust tool-use reasoning. Our pipeline first constructsM-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verifiedtrajectories, enabling imitation-based trajectory tuning. Building on this, wedevelop MATRIX Agent, a controller finetuned on M-TRACE for step-wise toolreasoning. To achieve finer alignment, we further introduce Pref-X, a set of11K automatically generated preference pairs, and optimize MATRIX on it viastep-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA,MATRIX consistently surpasses both open- and closed-source VLMs, demonstratingscalable and effective multimodal tool use. Our data and code is avaliable athttps://github.com/mbzuai-oryx/MATRIX.