GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Abstract

The evaluation of large language models (LLMs) has traditionally relied onstatic benchmarks, a paradigm that poses two major limitations: (1) predefinedtest sets lack adaptability to diverse application domains, and (2)standardized evaluation protocols often fail to capture fine-grainedassessments of domain-specific knowledge and contextual reasoning abilities. Toovercome these challenges, we propose GuessArena, an adaptive evaluationframework grounded in adversarial game-based interactions. Inspired by theinteractive structure of the Guess Who I Am? game, our framework seamlesslyintegrates dynamic domain knowledge modeling with progressive reasoningassessment to improve evaluation fidelity. Empirical studies across fivevertical domains-finance, healthcare, manufacturing, information technology,and education-demonstrate that GuessArena effectively distinguishes LLMs interms of domain knowledge coverage and reasoning chain completeness. Comparedto conventional benchmarks, our method provides substantial advantages ininterpretability, scalability, and scenario adaptability.

Quick Read (beta)

loading the full paper ...