Watch Your Language: Investigating Content Moderation with Large Language Models

Abstract

Large language models (LLMs) have exploded in popularity due to their abilityto perform a wide array of natural language tasks. Text-based contentmoderation is one LLM use case that has received recent enthusiasm, however,there is little research investigating how LLMs perform in content moderationsettings. In this work, we evaluate a suite of commodity LLMs on two commoncontent moderation tasks: rule-based community moderation and toxic contentdetection. For rule-based community moderation, we instantiate 95 subcommunityspecific LLMs by prompting GPT-3.5 with rules from 95 Reddit subcommunities. Wefind that GPT-3.5 is effective at rule-based moderation for many communities,achieving a median accuracy of 64% and a median precision of 83%. For toxicitydetection, we evaluate a suite of commodity LLMs (GPT-3, GPT-3.5, GPT-4, GeminiPro, LLAMA 2) and show that LLMs significantly outperform currently widespreadtoxicity classifiers. However, recent increases in model size add only marginalbenefit to toxicity detection, suggesting a potential performance plateau forLLMs on toxicity detection tasks. We conclude by outlining avenues for futurework in studying LLMs and content moderation.

Quick Read (beta)

loading the full paper ...