The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

Abstract

Large Language Models (LLMs) have an unrivaled and invaluable ability to"align" their output to a diverse range of human preferences, by mirroring themin the text they generate. The internal characteristics of such models,however, remain largely opaque. This work presents the Injectable RealignmentModel (IRM) as a novel approach to language model interpretability andexplainability. Inspired by earlier work on Neural Programming Interfaces, weconstruct and train a small network -- the IRM -- to induce emotion-basedalignments within a 7B parameter LLM architecture. The IRM outputs are injectedvia layerwise addition at various points during the LLM's forward pass, thusmodulating its behavior without changing the weights of the original model.This isolates the alignment behavior from the complex mechanisms of thetransformer model. Analysis of the trained IRM's outputs reveals a curiouspattern. Across more than 24 training runs and multiple alignment datasets,patterns of IRM activations align themselves in striations associated with aneuron's index within each transformer layer, rather than being associated withthe layers themselves. Further, a single neuron index (1512) is stronglycorrelated with all tested alignments. This result, although initiallycounterintuitive, is directly attributable to design choices present withinalmost all commercially available transformer architectures, and highlights apotential weak point in Meta's pretrained Llama 2 models. It also demonstratesthe value of the IRM architecture for language model analysis andinterpretability. Our code and datasets are available athttps://github.com/DRAGNLabs/injectable-alignment-model

Quick Read (beta)

loading the full paper ...