SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

  • 2023-07-27 18:24:09
  • Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan
  • 0

Abstract

Large language models (LLMs) have shown the potential to be integrated intohuman daily lives. Therefore, user preference is the most critical criterionfor assessing LLMs' performance in real-world scenarios. However, existingbenchmarks mainly focus on measuring models' accuracy using multi-choicequestions, which limits the understanding of their capabilities in realapplications. We fill this gap by proposing a comprehensive Chinese benchmarkSuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUEencompasses three sub-tasks: actual users' queries and ratings derived from anLLM battle platform (CArena), open-ended questions with single andmultiple-turn dialogues (OPEN), and closed-ended questions with the same stemsas open-ended single-turn ones (CLOSE). Our study shows that accuracy onclosed-ended questions is insufficient to reflect human preferences achieved onopen-ended ones. At the same time, they can complement each other to predictactual user preferences. We also demonstrate that GPT-4 is a reliable judge toautomatically evaluate human preferences on open-ended questions in a Chinesecontext. Our benchmark will be released at https://www.CLUEbenchmarks.com

 

Quick Read (beta)

loading the full paper ...