Abstract
Leveraging Multi-modal Large Language Models (MLLMs) to create embodiedagents offers a promising avenue for tackling real-world tasks. Whilelanguage-centric embodied agents have garnered substantial attention,MLLM-based embodied agents remain underexplored due to the lack ofcomprehensive evaluation frameworks. To bridge this gap, we introduceEmbodiedBench, an extensive benchmark designed to evaluate vision-drivenembodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testingtasks across four environments, ranging from high-level semantic tasks (e.g.,household) to low-level tasks involving atomic actions (e.g., navigation andmanipulation); and (2) six meticulously curated subsets evaluating essentialagent capabilities like commonsense reasoning, complex instructionunderstanding, spatial awareness, visual perception, and long-term planning.Through extensive experiments, we evaluated 13 leading proprietary andopen-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excelat high-level tasks but struggle with low-level manipulation, with the bestmodel, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides amultifaceted standardized evaluation platform that not only highlights existingchallenges but also offers valuable insights to advance MLLM-based embodiedagents. Our code is available at https://embodiedbench.github.io.