CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models

Abstract

Purpose: The rapid emergence of large language models (LLMs) such as ChatGPThas significantly impacted foreign language education, yet their pedagogicalgrammar competence remains under-assessed. This paper introduces CPG-EVAL, thefirst dedicated benchmark specifically designed to evaluate LLMs' knowledge ofpedagogical grammar within the context of foreign language instruction.Methodology: The benchmark comprises five tasks designed to assess grammarrecognition, fine-grained grammatical distinction, categorical discrimination,and resistance to linguistic interference. Findings: Smaller-scale models cansucceed in single language instance tasks, but struggle with multiple instancetasks and interference from confusing instances. Larger-scale models showbetter resistance to interference but still have significant room for accuracyimprovement. The evaluation indicates the need for better instructionalalignment and more rigorous benchmarks, to effectively guide the deployment ofLLMs in educational contexts. Value: This study offers the first specialized,theory-driven, multi-tiered benchmark framework for systematically evaluatingLLMs' pedagogical grammar competence in Chinese language teaching contexts.CPG-EVAL not only provides empirical insights for educators, policymakers, andmodel developers to better gauge AI's current abilities in educationalsettings, but also lays the groundwork for future research on improving modelalignment, enhancing educational suitability, and ensuring informeddecision-making concerning LLM integration in foreign language instruction.

Quick Read (beta)

loading the full paper ...