Sustainable Modular Debiasing of Language Models

Abstract

Unfair stereotypical biases (e.g., gender, racial, or religious biases)encoded in modern pretrained language models (PLMs) have negative ethicalimplications for widespread adoption of state-of-the-art language technology.To remedy for this, a wide range of debiasing techniques have recently beenintroduced to remove such stereotypical biases from PLMs. Existing debiasingmethods, however, directly modify all of the PLMs parameters, which -- besidesbeing computationally expensive -- comes with the inherent risk of(catastrophic) forgetting of useful language knowledge acquired in pretraining.In this work, we propose a more sustainable modular debiasing approach based ondedicated debiasing adapters, dubbed ADELE. Concretely, we (1) inject adaptermodules into the original PLM layers and (2) update only the adapters (i.e., wekeep the original PLM parameters frozen) via language modeling training on acounterfactually augmented corpus. We showcase ADELE, in gender debiasing ofBERT: our extensive evaluation, encompassing three intrinsic and two extrinsicbias measures, renders ADELE, very effective in bias mitigation. We furthershow that -- due to its modular nature -- ADELE, coupled with task adapters,retains fairness even after large-scale downstream training. Finally, by meansof multilingual BERT, we successfully transfer ADELE, to six target languages.

Quick Read (beta)

loading the full paper ...