Abstract
Training large language models to follow instructions makes them performbetter on a wide range of tasks, generally becoming more helpful. However, aperfectly helpful model will follow even the most malicious instructions andreadily generate harmful content. In this paper, we raise concerns over thesafety of models that only emphasize helpfulness, not safety, in theirinstruction-tuning. We show that several popular instruction-tuned models arehighly unsafe. Moreover, we show that adding just 3% safety examples (a fewhundred demonstrations) in the training set when fine-tuning a model like LLaMAcan substantially improve their safety. Our safety-tuning does not make modelssignificantly less capable or helpful as measured by standard benchmarks.However, we do find a behavior of exaggerated safety, where too muchsafety-tuning makes models refuse to respond to reasonable prompts thatsuperficially resemble unsafe ones. Our study sheds light on trade-offs intraining LLMs to follow instructions and exhibit safe behavior.