OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Abstract

Large language models excel at abstract reasoning but their capacity forembodied agent reasoning remains largely unexplored. We present OmniEAR, acomprehensive framework for evaluating how language models reason aboutphysical interactions, tool usage, and multi-agent coordination in embodiedtasks. Unlike existing benchmarks that provide predefined tool sets or explicitcollaboration directives, OmniEAR requires agents to dynamically acquirecapabilities and autonomously determine coordination strategies based on taskdemands. Through text-based environment representation, we model continuousphysical properties and complex spatial relationships across 1,500 scenariosspanning household and industrial domains. Our systematic evaluation revealssevere performance degradation when models must reason from constraints: whileachieving 85-96% success with explicit instructions, performance drops to56-85% for tool reasoning and 63-85% for implicit collaboration, with compoundtasks showing over 50% failure rates. Surprisingly, complete environmentalinformation degrades coordination performance, indicating models cannot filtertask-relevant constraints. Fine-tuning improves single-agent tasks dramatically(0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposingfundamental architectural limitations. These findings demonstrate that embodiedreasoning poses fundamentally different challenges than current models canaddress, establishing OmniEAR as a rigorous benchmark for evaluating andadvancing embodied AI systems. Our code and data are included in thesupplementary materials and will be open-sourced upon acceptance.

Quick Read (beta)

loading the full paper ...