Mitigating harm in language models with conditional-likelihood filtration

Abstract

Language models trained on large-scale unfiltered datasets curated from theopen web acquire systemic biases, prejudices, and harmful views from theirtraining data. We present a methodology for programmatically identifying andremoving harmful text from web-scale datasets. A pretrained language model isused to calculate the log-likelihood of researcher-written trigger phrasesconditioned on a specific document, which is used to identify and filterdocuments from the dataset. We demonstrate that models trained on this filtereddataset exhibit lower propensity to generate harmful text, with a marginaldecrease in performance on standard language modeling benchmarks compared tounfiltered baselines. We provide a partial explanation for this performance gapby surfacing examples of hate speech and other undesirable content fromstandard language modeling benchmarks. Finally, we discuss the generalizationof this method and how trigger phrases which reflect specific values can beused by researchers to build language models which are more closely alignedwith their values.

Quick Read (beta)

loading the full paper ...