Abstract
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs byenabling cooperation among multiple specialized agents. However, most existingMAS frameworks rely on a single LLM to drive all agents, constraining thesystem's intelligence to the limit of that model. This paper explores theparadigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered bydiverse LLMs, elevating the system's potential to the collective intelligenceof diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed toevaluate the performance of various LLMs across different domains andMAS-related functions. As an extensive empirical study, we assess 27 LLMsacross 5 domains (encompassing 21 test sets) and 5 functions, conducting over1.7 million evaluations to identify optimal model selections for eachdomain-function combination. Building on these findings, we demonstrate thattransitioning from homogeneous to heterogeneous LLM-driven MAS cansignificantly enhance system performance without requiring structural redesign.Specifically, in a chatbot-only MAS scenario, the heterogeneous configurationyields up to 8.4\% performance improvement on the MATH dataset. In a mixedchatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable47\% performance boost on the AIME dataset. Our results underscore thetransformative potential of heterogeneous LLMs in MAS, highlighting a promisingavenue for advancing scalable, collaborative AI systems.