CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

  • 2024-01-02 16:20:40
  • Quan Tu, Shilong Fan, Zihang Tian, Rui Yan
  • 0

Abstract

Recently, the advent of large language models (LLMs) has revolutionizedgenerative agents. Among them, Role-Playing Conversational Agents (RPCAs)attract considerable attention due to their ability to emotionally engageusers. However, the absence of a comprehensive benchmark impedes progress inthis field. To bridge this gap, we introduce CharacterEval, a Chinese benchmarkfor comprehensive RPCA assessment, complemented by a tailored high-qualitydataset. The dataset comprises 1,785 multi-turn role-playing dialogues,encompassing 23,020 examples and featuring 77 characters derived from Chinesenovels and scripts. It was carefully constructed, beginning with initialdialogue extraction via GPT-4, followed by rigorous human-led quality control,and enhanced with in-depth character profiles sourced from Baidu Baike.CharacterEval employs a multifaceted evaluation approach, encompassing thirteentargeted metrics on four dimensions. Comprehensive experiments on CharacterEvaldemonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 inChinese role-playing conversation. Source code, data source and reward modelwill be publicly accessible at https://github.com/morecry/CharacterEval.

 

Quick Read (beta)

loading the full paper ...