Reducing the Scope of Language Models

Abstract

We now deploy language models in a wide variety of user-facing applications.Typically, these deployments have some specific purpose, like answeringquestions about documentation or acting as coding assistants, but they requiregeneral language understanding. Under these circumstances these models shouldnot be able to answer irrelevant requests such as, poetry generation orquestions about physics, etc. Instead we would like language models to onlyanswer to queries corresponding to desired behavior and refuse all otherrequests, which we refer to as scoping. We conduct a comprehensive empiricalevaluation of potential methods from prompting to fine-tuning to preferencelearning to a recently proposed method for general alignment called CircuitBreakers (CB). Across three families of language models and a broad variety oftasks, we show that it is possible to scope language models. We examine scopingfor multiple topics, and fine-grained topics. We ablate diversity of irrelevantqueries, layer different techniques, conduct adversarial evaluations and more.Among other results, we find that, when diverse examples of irrelevant queriesare available, simple supervised fine-tuning produces the best results, butwhen such diversity is low, Circuit Breakers perform quite well. One can oftenget the benefits of both methods by layering them in succession. We intend ourstudy to serve as a practitioner's guide to scoping language models.

Quick Read (beta)

loading the full paper ...