Abstract
Recently, the advent of large language models (LLMs) has revolutionizedgenerative agents. Among them, Role-Playing Conversational Agents (RPCAs)attract considerable attention due to their ability to emotionally engageusers. However, the absence of a comprehensive benchmark impedes progress inthis field. To bridge this gap, we introduce CharacterEval, a Chinese benchmarkfor comprehensive RPCA assessment, complemented by a tailored high-qualitydataset. The dataset comprises 1,785 multi-turn role-playing dialogues,encompassing 23,020 examples and featuring 77 characters derived from Chinesenovels and scripts. It was carefully constructed, beginning with initialdialogue extraction via GPT-4, followed by rigorous human-led quality control,and enhanced with in-depth character profiles sourced from Baidu Baike.CharacterEval employs a multifaceted evaluation approach, encompassing thirteentargeted metrics on four dimensions. Comprehensive experiments on CharacterEvaldemonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 inChinese role-playing conversation. Source code, data source and reward modelwill be publicly accessible at https://github.com/morecry/CharacterEval.