Understanding (Un)Reliability of Steering Vectors in Language Models

Abstract

Steering vectors are a lightweight method to control language model behaviorby adding a learned bias to the activations at inference time. Althoughsteering demonstrates promising performance, recent work shows that it can beunreliable or even counterproductive in some cases. This paper studies theinfluence of prompt types and the geometry of activation differences onsteering reliability. First, we find that all seven prompt types used in ourexperiments produce a net positive steering effect, but exhibit high varianceacross samples, and often give an effect opposite of the desired one. No prompttype clearly outperforms the others, and yet the steering vectors resultingfrom the different prompt types often differ directionally (as measured bycosine similarity). Second, we show that higher cosine similarity betweentraining set activation differences predicts more effective steering. Finally,we observe that datasets where positive and negative activations are betterseparated are more steerable. Our results suggest that vector steering isunreliable when the target behavior is not represented by a coherent direction.

Quick Read (beta)

loading the full paper ...