Safer-Instruct: Aligning Language Models with Automated Preference Data

Abstract

Reinforcement learning from human feedback (RLHF) is a vital strategy forenhancing model capability in language models. However, annotating preferencedata for RLHF is a resource-intensive and creativity-demanding process, whileexisting automatic generation methods face limitations in data diversity andquality. In response, we present Safer-Instruct, a novel pipeline forautomatically constructing large-scale preference data. Our approach leveragesreversed instruction tuning, instruction induction, and expert model evaluationto efficiently generate high-quality preference data without human annotators.To verify the effectiveness of Safer-Instruct, we apply the pipeline toconstruct a safety preference dataset as a case study. Finetuning an Alpacamodel on this synthetic dataset not only demonstrates improved harmlessness butalso outperforms models fine-tuned on human-annotated safety preference data,all the while maintaining a competitive edge in downstream tasks. Importantly,our Safer-Instruct framework is versatile and can be applied to generatepreference data across various domains, extending its utility beyond safetypreferences. It addresses the challenges in preference data acquisition andadvances the development of more capable and responsible AI systems. Fordataset and code implementation, seehttps://github.com/uscnlp-lime/safer-instruct

Quick Read (beta)

loading the full paper ...