Abstract
Coordinating multiple embodied agents in dynamic environments remains a corechallenge in artificial intelligence, requiring both perception-drivenreasoning and scalable cooperation strategies. While recent works haveleveraged large language models (LLMs) for multi-agent planning, a few havebegun to explore vision-language models (VLMs) for visual reasoning. However,these VLM-based approaches remain limited in their support for diverseembodiment types. In this work, we introduce VIKI-Bench, the first hierarchicalbenchmark tailored for embodied multi-agent cooperation, featuring threestructured levels: agent activation, task planning, and trajectory perception.VIKI-Bench includes diverse robot embodiments, multi-view visual observations,and structured supervision signals to evaluate reasoning grounded in visualinputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, atwo-stage framework that fine-tunes a pretrained vision-language model (VLM)using Chain-of-Thought annotated demonstrations, followed by reinforcementlearning under multi-level reward signals. Our extensive experiments show thatVIKI-R significantly outperforms baselines method across all task levels.Furthermore, we show that reinforcement learning enables the emergence ofcompositional cooperation patterns among heterogeneous agents. Together,VIKI-Bench and VIKI-R offer a unified testbed and method for advancingmulti-agent, visual-driven cooperation in embodied AI systems.