Abstract
Realizing general-purpose language intelligence has been a longstanding goalfor natural language processing, where standard evaluation benchmarks play afundamental and guiding role. We argue that for general-purpose languageintelligence evaluation, the benchmark itself needs to be comprehensive andsystematic. To this end, we propose CUGE, a Chinese Language Understanding andGeneration Evaluation benchmark with the following features: (1) Hierarchicalbenchmark framework, where datasets are principally selected and organized witha language capability-task-dataset hierarchy. (2) Multi-level scoring strategy,where different levels of model performance are provided based on thehierarchical framework. To facilitate CUGE, we provide a public leaderboardthat can be customized to support flexible model judging criteria. Evaluationresults on representative pre-trained language models indicate ample room forimprovement towards general-purpose language intelligence. CUGE is publiclyavailable at cuge.baai.ac.cn.