Abstract
A fundamental objective in robot manipulation is to enable models tocomprehend visual scenes and execute actions. Although existing robotMultimodal Large Language Models (MLLMs) can handle a range of basic tasks,they still face challenges in two areas: 1) inadequate reasoning ability totackle complex tasks, and 2) high computational costs for MLLM fine-tuning andinference. The recently proposed state space model (SSM) known as Mambademonstrates promising capabilities in non-trivial sequence modeling withlinear inference complexity. Inspired by this, we introduce RoboMamba, anend-to-end robotic MLLM that leverages the Mamba model to deliver both roboticreasoning and action capabilities, while maintaining efficient fine-tuning andinference. Specifically, we first integrate the vision encoder with Mamba,aligning visual data with language embedding through co-training, empoweringour model with visual common sense and robot-related reasoning. To furtherequip RoboMamba with action pose prediction abilities, we explore an efficientfine-tuning strategy with a simple policy head. We find that once RoboMambapossesses sufficient reasoning capability, it can acquire manipulation skillswith minimal fine-tuning parameters (0.1\% of the model) and time (20 minutes).In experiments, RoboMamba demonstrates outstanding reasoning capabilities ongeneral and robotic evaluation benchmarks. Meanwhile, our model showcasesimpressive pose prediction results in both simulation and real-worldexperiments, achieving inference speeds 7 times faster than existing robotMLLMs. Our project web page: https://sites.google.com/view/robomamba-web