ShieldGemma: Generative AI Content Moderation Based on Gemma

  • 2024-07-31 18:48:14
  • Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, Oscar Wahltinez
  • 0

Abstract

We present ShieldGemma, a comprehensive suite of LLM-based safety contentmoderation models built upon Gemma2. These models provide robust,state-of-the-art predictions of safety risks across key harm types (sexuallyexplicit, dangerous content, harassment, hate speech) in both user input andLLM-generated output. By evaluating on both public and internal benchmarks, wedemonstrate superior performance compared to existing models, such as LlamaGuard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%).Additionally, we present a novel LLM-based data curation pipeline, adaptable toa variety of safety-related tasks and beyond. We have shown stronggeneralization performance for model trained mainly on synthetic data. Byreleasing ShieldGemma, we provide a valuable resource to the researchcommunity, advancing LLM safety and enabling the creation of more effectivecontent moderation solutions for developers.

 

Quick Read (beta)

loading the full paper ...