GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems

Abstract

Multi-agent systems built on language models have shown strong performance oncollaborative reasoning tasks. However, existing evaluations focus only on thecorrectness of the final output, overlooking how inefficient communication andpoor coordination contribute to redundant reasoning and higher computationalcosts. We introduce GEMMAS, a graph-based evaluation framework that analyzesthe internal collaboration process by modeling agent interactions as a directedacyclic graph. To capture collaboration quality, we propose two process-levelmetrics: Information Diversity Score (IDS) to measure semantic variation ininter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundantreasoning paths. We evaluate GEMMAS across five benchmarks and highlightresults on GSM8K, where systems with only a 2.1% difference in accuracy differby 12.8% in IDS and 80% in UPR, revealing substantial variation in internalcollaboration. These findings demonstrate that outcome-only metrics areinsufficient for evaluating multi-agent performance and highlight theimportance of process-level diagnostics in designing more interpretable andresource-efficient collaborative AI systems.

Quick Read (beta)

loading the full paper ...