Abstract
The evaluation of large language models (LLMs) has traditionally relied onstatic benchmarks, a paradigm that poses two major limitations: (1) predefinedtest sets lack adaptability to diverse application domains, and (2)standardized evaluation protocols often fail to capture fine-grainedassessments of domain-specific knowledge and contextual reasoning abilities. Toovercome these challenges, we propose GuessArena, an adaptive evaluationframework grounded in adversarial game-based interactions. Inspired by theinteractive structure of the Guess Who I Am? game, our framework seamlesslyintegrates dynamic domain knowledge modeling with progressive reasoningassessment to improve evaluation fidelity. Empirical studies across fivevertical domains-finance, healthcare, manufacturing, information technology,and education-demonstrate that GuessArena effectively distinguishes LLMs interms of domain knowledge coverage and reasoning chain completeness. Comparedto conventional benchmarks, our method provides substantial advantages ininterpretability, scalability, and scenario adaptability.