Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

  • 2024-07-15 18:54:37
  • Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
  • 0

Abstract

Data science and engineering workflows often span multiple stages, fromwarehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. Asvision language models (VLMs) advance in multimodal understanding and codegeneration, VLM-based agents could potentially automate these workflows bygenerating SQL queries, Python code, and GUI operations. This automation canimprove the productivity of experts while democratizing access to large-scaledata analysis. In this paper, we introduce Spider2-V, the first multimodalagent benchmark focusing on professional data science and engineeringworkflows, featuring 494 real-world tasks in authentic computer environmentsand incorporating 20 enterprise-level professional applications. These tasks,derived from real-world use cases, evaluate the ability of a multimodal agentto perform data-related tasks by writing code and managing the GUI inenterprise data software systems. To balance realistic simulation withevaluation simplicity, we devote significant effort to developing automaticconfigurations for task setup and carefully crafting evaluation metrics foreach task. Furthermore, we supplement multimodal agents with comprehensivedocuments of these enterprise data software systems. Our empirical evaluationreveals that existing state-of-the-art LLM/VLM-based agents do not reliablyautomate full data workflows (14.0% success). Even with step-by-step guidance,these agents still underperform in tasks that require fine-grained,knowledge-intensive GUI actions (16.2%) and involve remote cloud-hostedworkspaces (10.6%). We hope that Spider2-V paves the way for autonomousmultimodal agents to transform the automation of data science and engineeringworkflow. Our code and data are available at https://spider2-v.github.io.

 

Quick Read (beta)

loading the full paper ...