Abstract
Large language models (LLMs) have shown the potential to be integrated intohuman daily lives. Therefore, user preference is the most critical criterionfor assessing LLMs' performance in real-world scenarios. However, existingbenchmarks mainly focus on measuring models' accuracy using multi-choicequestions, which limits the understanding of their capabilities in realapplications. We fill this gap by proposing a comprehensive Chinese benchmarkSuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUEencompasses three sub-tasks: actual users' queries and ratings derived from anLLM battle platform (CArena), open-ended questions with single andmultiple-turn dialogues (OPEN), and closed-ended questions with the same stemsas open-ended single-turn ones (CLOSE). Our study shows that accuracy onclosed-ended questions is insufficient to reflect human preferences achieved onopen-ended ones. At the same time, they can complement each other to predictactual user preferences. We also demonstrate that GPT-4 is a reliable judge toautomatically evaluate human preferences on open-ended questions in a Chinesecontext. Our benchmark will be released at https://www.CLUEbenchmarks.com