Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

  • 2025-09-02 08:20:59
  • Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li
  • 0

Abstract

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)have ushered in a new era of AI capabilities, demonstrating near-human-levelperformance across diverse scenarios. While numerous benchmarks (e.g., MMLU)and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve thedevelopment of LLMs and MLLMs, most rely on static datasets or crowdsourcedgeneral-domain prompts, often falling short of reflecting performance inreal-world applications. To bridge this critical gap, we present InclusionArena, a live leaderboard that ranks models based on human feedback collecteddirectly from AI-powered applications. Our platform integrates pairwise modelcomparisons into natural user interactions, ensuring evaluations reflectpractical usage scenarios. For robust model ranking, we employ theBradley-Terry model augmented with two key innovations: (1) Placement Matches,a cold-start mechanism to quickly estimate initial ratings for newly integratedmodels, and (2) Proximity Sampling, an intelligent comparison strategy thatprioritizes battles between models of similar capabilities to maximizeinformation gain and enhance rating stability. Extensive empirical analyses andsimulations demonstrate that Inclusion Arena yields reliable and stablerankings, exhibits higher data transitivity compared to general crowdsourceddatasets, and significantly mitigates the risk of malicious manipulation. Byfostering an open alliance between foundation models and real-worldapplications, Inclusion Arena aims to accelerate the development of LLMs andMLLMs truly optimized for practical, user-centric deployments. The platform ispublicly accessible at https://www.tbox.cn/about/model-ranking.

 

Quick Read (beta)

loading the full paper ...