Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents

Abstract

Large Language Model (LLM) agents frameworks often employ modulararchitectures, incorporating components such as planning, reasoning, actionexecution, and reflection to tackle complex tasks. However, quantifying thecontribution of each module to overall system performance remains a significantchallenge, impeding optimization and interpretability. To address this, weintroduce CapaBench (Capability-level Assessment Benchmark), an evaluationframework grounded in cooperative game theory's Shapley Value, whichsystematically measures the marginal impact of individual modules and theirinteractions within an agent's architecture. By replacing default modules withtest variants across all possible combinations, CapaBench provides a principlemethod for attributing performance contributions. Key contributions include:(1) We are the first to propose a Shapley Value-based methodology forquantifying the contributions of capabilities in LLM agents; (2) Modules withhigh Shapley Values consistently lead to predictable performance gains whencombined, enabling targeted optimization; and (3) We build a multi-rounddataset of over 1,000 entries spanning diverse domains and practical taskscenarios, enabling comprehensive evaluation of agent capabilities. CapaBenchbridges the gap between component-level evaluation and holistic systemassessment, providing actionable insights for optimizing modular LLM agents andadvancing their deployment in complex, real-world scenarios.

Quick Read (beta)

loading the full paper ...