Abstract
We present ShieldGemma, a comprehensive suite of LLM-based safety contentmoderation models built upon Gemma2. These models provide robust,state-of-the-art predictions of safety risks across key harm types (sexuallyexplicit, dangerous content, harassment, hate speech) in both user input andLLM-generated output. By evaluating on both public and internal benchmarks, wedemonstrate superior performance compared to existing models, such as LlamaGuard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%).Additionally, we present a novel LLM-based data curation pipeline, adaptable toa variety of safety-related tasks and beyond. We have shown stronggeneralization performance for model trained mainly on synthetic data. Byreleasing ShieldGemma, we provide a valuable resource to the researchcommunity, advancing LLM safety and enabling the creation of more effectivecontent moderation solutions for developers.