SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Abstract

Large language models (LLMs) have shown the potential to be integrated intohuman daily lives. Therefore, user preference is the most critical criterionfor assessing LLMs' performance in real-world scenarios. However, existingbenchmarks mainly focus on measuring models' accuracy using multi-choicequestions, which limits the understanding of their capabilities in realapplications. We fill this gap by proposing a comprehensive Chinese benchmarkSuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUEencompasses three sub-tasks: actual users' queries and ratings derived from anLLM battle platform (CArena), open-ended questions with single andmultiple-turn dialogues (OPEN), and closed-ended questions with the same stemsas open-ended single-turn ones (CLOSE). Our study shows that accuracy onclosed-ended questions is insufficient to reflect human preferences achieved onopen-ended ones. At the same time, they can complement each other to predictactual user preferences. We also demonstrate that GPT-4 is a reliable judge toautomatically evaluate human preferences on open-ended questions in a Chinesecontext. Our benchmark will be released at https://www.CLUEbenchmarks.com

Quick Read (beta)

loading the full paper ...