Abstract
Large Language Models (LLMs) have advanced various Natural LanguageProcessing (NLP) tasks, such as text generation and translation, among others.However, these models often generate texts that can perpetuate biases. Existingapproaches to mitigate these biases usually compromise knowledge retention.This study explores whether LLMs can produce safe, unbiased outputs withoutsacrificing knowledge or comprehension. We introduce the Safe and ResponsibleLarge Language Model (\textbf{SR}$_{\text{LLM}}$), which has been instructionfine-tuned atop of a safe fine-tuned auto-regressive decoder-only LLM to reducebiases in generated texts. We developed a specialized dataset with examples ofunsafe and corresponding safe variations to train \textbf{SR}$_{\text{LLM}}$ toidentify and correct biased text. Experiments on our specialized dataset andout-of-distribution test sets reveal that \textbf{SR}$_{\text{LLM}}$effectively reduces biases while preserving knowledge integrity. Thisperformance surpasses that of traditional fine-tuning of smaller languagemodels and base LLMs that merely reply on prompting techniques. Our findingsdemonstrate that instruction fine-tuning on custom datasets tailored for taskssuch as debiasing is a highly effective strategy for minimizing bias in LLMwhile preserving their inherent knowledge and capabilities. The code anddataset are accessible at\href{https://github.com/shainarazavi/Safe-Responsible-LLM}{SR-LLM}