Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

Abstract

Large Language Models (LLMs) have significantly advanced various NLP tasks.However, these models often risk generating unsafe text that perpetuatesbiases. Current approaches to produce unbiased outputs from LLMs can reducebiases but at the expense of knowledge retention. In this research, we addressthe question of whether producing safe (unbiased) outputs through LLMs canretain knowledge and language understanding. In response, we developed theSafety and Responsible Large Language Model (\textbf{SR}$_{\text{LLM}}$), anLLM that has been instruction fine-tuned on top of already safe LLMs (e.g.,Llama2 or related) to diminish biases in generated text. To achieve our goals,we compiled a specialized dataset designed to train our model in identifyingand correcting biased text. We conduct experiments, both on this custom dataand out-of-distribution test sets, to show the bias reduction and knowledgeretention. The results confirm that \textbf{SR}$_{\text{LLM}}$ outperformstraditional fine-tuning and prompting methods in both reducing biases andpreserving the integrity of language knowledge. The significance of ourfindings lies in demonstrating that instruction fine-tuning can provide a morerobust solution for bias reduction in LLMs. We have made our code and dataavailable at\href{https://github.com/shainarazavi/Safe-Responsible-LLM}{Safe-LLM}.

Quick Read (beta)

loading the full paper ...