Abstract
In da Vinci robotic surgery, surgeons' hands and eyes are fully engaged inthe procedure, making it difficult to access and manipulate multimodal patientdata without interruption. We propose a voice-directed Surgical AgentOrchestrator Platform (SAOP) built on a hierarchical multi-agent framework,consisting of an orchestration agent and three task-specific agents driven byLarge Language Models (LLMs). These LLM-based agents autonomously plan, refine,validate, and reason to map voice commands into specific tasks such asretrieving clinical information, manipulating CT scans, or navigating 3Danatomical models on the surgical video. We also introduce a Multi-levelOrchestration Evaluation Metric (MOEM) to comprehensively assess theperformance and robustness from command-level and category-level perspectives.The SAOP achieves high accuracy and success rates across 240 voice commands,while LLM-based agents improve robustness against speech recognition errors anddiverse or ambiguous free-form commands, demonstrating strong potential tosupport minimally invasive da Vinci robotic surgery.