Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation

Abstract

Recent generative large language models (LLMs) show remarkable performance innon-English languages, but when prompted in those languages they tend toexpress higher harmful social biases and toxicity levels. Prior work has shownthat finetuning on specialized datasets can mitigate this behavior, and doingso in English can transfer to other languages. In this work, we investigate theimpact of different finetuning methods on the model's bias and toxicity, butalso on its ability to produce fluent and diverse text. We reduce biases byfinetuning on curated non-harmful text, but find only direct preferenceoptimization to be effective for mitigating toxicity. The mitigation caused byapplying these methods in English also transfers to non-English languages. Wefind evidence that the extent to which transfer takes place can be predicted bythe amount of data in a given language present in the model's pretraining data.However, this transfer of bias and toxicity mitigation often comes at theexpense of decreased language generation ability in non-English languages,highlighting the importance of developing language-specific bias and toxicitymitigation methods.

Quick Read (beta)

loading the full paper ...