TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

Abstract

The variety of data in data lakes presents significant challenges for dataanalytics, as data scientists must simultaneously analyze multi-modal data,including structured, semi-structured, and unstructured data. While LargeLanguage Models (LLMs) have demonstrated promising capabilities, they stillremain inadequate for multi-modal data analytics in terms of accuracy,efficiency, and freshness. First, current natural language (NL) or SQL-likequery languages may struggle to precisely and comprehensively capture users'analytical intent. Second, relying on a single unified LLM to process diversedata modalities often leads to substantial inference overhead. Third, datastored in data lakes may be incomplete or outdated, making it essential tointegrate external open-domain knowledge to generate timely and relevantanalytics results. In this paper, we envision a new multi-modal data analytics system.Specifically, we propose a novel architecture built upon the Model ContextProtocol (MCP), an emerging paradigm that enables LLMs to collaborate withknowledgeable agents. First, we define a semantic operator hierarchy tailoredfor querying multi-modal data in data lakes and develop an AI-agent-poweredNL2Operator translator to bridge user intent and analytical execution. Next, weintroduce an MCP-based execution framework, in which each MCP server hostsspecialized foundation models optimized for specific data modalities. Thisdesign enhances both accuracy and efficiency, while supporting high scalabilitythrough modular deployment. Finally, we propose a updating mechanism byharnessing the deep research and machine unlearning techniques to refresh thedata lakes and LLM knowledges, with the goal of balancing the data freshnessand inference efficiency.

Quick Read (beta)

loading the full paper ...