Who's Harry Potter? Approximate Unlearning in LLMs

Abstract

Large language models (LLMs) are trained on massive internet corpora thatoften contain copyrighted content. This poses legal and ethical challenges forthe developers and users of these models, as well as the original authors andpublishers. In this paper, we propose a novel technique for unlearning a subsetof the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter booksfrom the Llama2-7b model (a generative language model recently open-sourced byMeta). While the model took over 184K GPU-hours to pretrain, we show that inabout 1 GPU hour of finetuning, we effectively erase the model's ability togenerate or recall Harry Potter-related content, while its performance oncommon benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remainsalmost unaffected. We make our fine-tuned model publicly available onHuggingFace for community evaluation. To the best of our knowledge, this is thefirst paper to present an effective technique for unlearning in generativelanguage models. Our technique consists of three main components: First, we use a reinforcedmodel that is further trained on the target data to identify the tokens thatare most related to the unlearning target, by comparing its logits with thoseof a baseline model. Second, we replace idiosyncratic expressions in the targetdata with generic counterparts, and leverage the model's own predictions togenerate alternative labels for every token. These labels aim to approximatethe next-token predictions of a model that has not been trained on the targetdata. Third, we finetune the model on these alternative labels, whicheffectively erases the original text from the model's memory whenever it isprompted with its context.

Quick Read (beta)

loading the full paper ...