OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Abstract

Autonomous agents powered by large language models (LLMs) are increasinglydeployed in real-world applications requiring complex, long-horizon workflows.However, existing benchmarks predominantly focus on atomic tasks that areself-contained and independent, failing to capture the long-term contextualdependencies and multi-interaction coordination required in realisticscenarios. To address this gap, we introduce OdysseyBench, a comprehensivebenchmark for evaluating LLM agents on long-horizon workflows across diverseoffice applications including Word, Excel, PDF, Email, and Calendar. Ourbenchmark comprises two complementary splits: OdysseyBench+ with 300 tasksderived from real-world use cases, and OdysseyBench-Neo with 302 newlysynthesized complex tasks. Each task requires agent to identify essentialinformation from long-horizon interaction histories and perform multi-stepreasoning across various applications. To enable scalable benchmark creation,we propose HomerAgents, a multi-agent framework that automates the generationof long-horizon workflow benchmarks through systematic environment exploration,task generation, and dialogue synthesis. Our extensive evaluation demonstratesthat OdysseyBench effectively challenges state-of-the-art LLM agents, providingmore accurate assessment of their capabilities in complex, real-world contextscompared to existing atomic task benchmarks. We believe that OdysseyBench willserve as a valuable resource for advancing the development and evaluation ofLLM agents in real-world productivity scenarios. In addition, we releaseOdysseyBench and HomerAgents to foster research along this line.

Quick Read (beta)

loading the full paper ...