Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning

Abstract

Traditional reinforcement learning-based robotic control methods are oftentask-specific and fail to generalize across diverse environments or unseenobjects and instructions. Visual Language Models (VLMs) demonstrate strongscene understanding and planning capabilities but lack the ability to generateactionable policies tailored to specific robotic embodiments. To address this,Visual-Language-Action (VLA) models have emerged, yet they face challenges inlong-horizon spatial reasoning and grounded task planning. In this work, wepropose the Embodied Multimodal Action Model with Grounded Chain of Thought andLook-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructedhierarchical embodiment dataset based on BridgeV2, containing 60,000 robotmanipulation trajectories auto-annotated with grounded task reasoning andspatial guidance. Additionally, we introduce a trajectory segmentation strategybased on gripper states and motion trajectories, which can help mitigatehallucination in grounding subtask reasoning generation. Experimental resultsdemonstrate that Emma-X achieves superior performance over competitivebaselines, particularly in real-world robotic tasks requiring spatialreasoning.

Quick Read (beta)

loading the full paper ...