Abstract
While various vertical domain large language models (LLMs) have beendeveloped, the challenge of automatically evaluating their performance acrossdifferent domains remains significant. Current benchmark-based evaluationmethods exhibit rigid, aimless interactions and rely on pre-collected staticdatasets that are costly to build, inflexible across domains, and misalignedwith practical user needs. To address this issue, we revisit the evaluationcomponents and introduce two concepts: Benchmark+, which extends traditionalquestion-answer benchmark into a more flexible "strategy-criterion" format; andAssessment+, which enhances the interaction process, enabling deeperexploration and supporting both quantitative metrics and qualitative insights.These concepts capture the nuanced behaviors of LLMs through richer, multi-turninteractions. We propose an agent-based evaluation framework called TestAgent,which implements these concepts through retrieval augmented generation andreinforcement learning. Experiments on tasks ranging from constructing verticaldomain evaluation to activating existing benchmarks demonstrate theeffectiveness of TestAgent across various scenarios. We believe this workoffers an interesting perspective on automatic evaluation for LLMs.