Abstract
Recent generative large language models (LLMs) show remarkable performance innon-English languages, but when prompted in those languages they tend toexpress higher harmful social biases and toxicity levels. Prior work has shownthat finetuning on specialized datasets can mitigate this behavior, and doingso in English can transfer to other languages. In this work, we investigate theimpact of different finetuning methods on the model's bias and toxicity, butalso on its ability to produce fluent and diverse text. Our results show thatfinetuning on curated non-harmful text is more effective for mitigating bias,and finetuning on direct preference optimization (DPO) datasets is moreeffective for mitigating toxicity. The mitigation caused by applying thesemethods in English also transfers to non-English languages. We find evidencethat the extent to which transfer takes place can be predicted by the amount ofdata in a given language present in the model's pretraining data. However, thistransfer of bias and toxicity mitigation often comes at the expense ofdecreased language generation ability in non-English languages, highlightingthe importance of developing language-specific bias and toxicity mitigationmethods.