RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Abstract

Pretrained neural language models (LMs) are prone to generating racist,sexist, or otherwise toxic language which hinders their safe deployment. Weinvestigate the extent to which pretrained LMs can be prompted to generatetoxic language, and the effectiveness of controllable text generationalgorithms at preventing such toxic degeneration. We create and releaseRealToxicityPrompts, a dataset of 100K naturally occurring, sentence-levelprompts derived from a large corpus of English web text, paired with toxicityscores from a widely-used toxicity classifier. Using RealToxicityPrompts, wefind that pretrained LMs can degenerate into toxic text even from seeminglyinnocuous prompts. We empirically assess several controllable generationmethods, and find that while data- or compute-intensive methods (e.g., adaptivepretraining on non-toxic data) are more effective at steering away fromtoxicity than simpler solutions (e.g., banning "bad" words), no current methodis failsafe against neural toxic degeneration. To pinpoint the potential causeof such persistent toxic degeneration, we analyze two web text corpora used topretrain several LMs (including GPT-2; Radford et. al, 2019), and find asignificant amount of offensive, factually unreliable, and otherwise toxiccontent. Our work provides a test bed for evaluating toxic generations by LMsand stresses the need for better data selection processes for pretraining.

Quick Read (beta)

loading the full paper ...