Abstract
Large Language Models (LLMs) are being used for a wide variety of tasks.While they are capable of generating human-like responses, they can alsoproduce undesirable output including potentially harmful information, racist orsexist language, and hallucinations. Alignment methods are designed to reducesuch undesirable output, via techniques such as fine-tuning, promptengineering, and representation engineering. However, existing methods faceseveral challenges: some require costly fine-tuning for every alignment task;some do not adequately remove undesirable concepts, failing alignment; someremove benign concepts, lowering the linguistic capabilities of LLMs. Toaddress these issues, we propose Parsimonious Concept Engineering (PaCE), anovel activation engineering framework for alignment. First, to sufficientlymodel the concepts, we construct a large-scale concept dictionary in theactivation space, in which each atom corresponds to a semantic concept. Then,given any alignment task, we instruct a concept partitioner to efficientlyannotate the concepts as benign or undesirable. Finally, at inference time, wedecompose the LLM activations along the concept dictionary via sparse coding,to accurately represent the activation as a linear combination of the benignand undesirable components. By removing the latter ones from the activation, wereorient the behavior of LLMs towards alignment goals. We conduct experimentson tasks such as response detoxification, faithfulness enhancement, andsentiment revising, and show that PaCE achieves state-of-the-art alignmentperformance while maintaining linguistic capabilities.