ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Abstract

Large language models (LLMs) have been increasingly applied to automatedharmful content detection tasks, assisting moderators in identifying policyviolations and improving the overall efficiency and accuracy of content review.However, existing resources for harmful content detection are predominantlyfocused on English, with Chinese datasets remaining scarce and often limited inscope. We present a comprehensive, professionally annotated benchmark forChinese content harm detection, which covers six representative categories andis constructed entirely from real-world data. Our annotation process furtheryields a knowledge rule base that provides explicit expert knowledge to assistLLMs in Chinese harmful content detection. In addition, we propose aknowledge-augmented baseline that integrates both human-annotated knowledgerules and implicit knowledge from large language models, enabling smallermodels to achieve performance comparable to state-of-the-art LLMs. Code anddata are available at https://github.com/zjunlp/ChineseHarm-bench.

Quick Read (beta)

loading the full paper ...