Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

Abstract

Language models can generate harmful and biased outputs and exhibitundesirable behavior according to a given cultural context. We propose aProcess for Adapting Language Models to Society (PALMS) with Values-TargetedDatasets, an iterative process to significantly change model behavior bycrafting and fine-tuning on a dataset that reflects a predetermined set oftarget values. We evaluate our process using three metrics: quantitativemetrics with human evaluations that score output adherence to a target value,toxicity scoring on outputs; and qualitative metrics analyzing the most commonword associated with a given social category. Through each iteration, we addadditional training dataset examples based on observed shortcomings fromevaluations. PALMS performs significantly better on all metrics compared tobaseline and control models for a broad range of GPT-3 language model sizeswithout compromising capability integrity. We find that the effectiveness ofPALMS increases with model size. We show that significantly adjusting languagemodel behavior is feasible with a small, hand-curated dataset.

Quick Read (beta)

loading the full paper ...