Magma: A Foundation Model for Multimodal AI Agents

Abstract

We present Magma, a foundation model that serves multimodal AI agentic tasksin both the digital and physical worlds. Magma is a significant extension ofvision-language (VL) models in that it not only retains the VL understandingability (verbal intelligence) of the latter, but is also equipped with theability to plan and act in the visual-spatial world (spatial-temporalintelligence) and complete agentic tasks ranging from UI navigation to robotmanipulation. To endow the agentic capabilities, Magma is pretrained on largeamounts of heterogeneous datasets spanning from images, videos to roboticsdata, where the actionable visual objects (e.g., clickable buttons in GUI) inimages are labeled by Set-of-Mark (SoM) for action grounding, and the objectmovements (e.g., the trace of human hands or robotic arms) in videos arelabeled by Trace-of-Mark (ToM) for action planning. Extensive experiments showthat SoM and ToM reach great synergy and facilitate the acquisition ofspatial-temporal intelligence for our Magma model, which is fundamental to awide range of tasks as shown in Fig.1. In particular, Magma creates newstate-of-the-art results on UI navigation and robotic manipulation tasks,outperforming previous models that are specifically tailored to these tasks. Onimage and video-related multimodal tasks, Magma also compares favorably topopular large multimodal models that are trained on much larger datasets. Wemake our model and code public for reproducibility athttps://microsoft.github.io/Magma.

Quick Read (beta)

loading the full paper ...