Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Abstract

Large language models (LLMs) often exhibit sycophantic behaviors -- such asexcessive agreement with or flattery of the user -- but it is unclear whetherthese behaviors arise from a single mechanism or multiple distinct processes.We decompose sycophancy into sycophantic agreement and sycophantic praise,contrasting both with genuine agreement. Using difference-in-means directions,activation additions, and subspace geometry across multiple models anddatasets, we show that: (1) the three behaviors are encoded along distinctlinear directions in latent space; (2) each behavior can be independentlyamplified or suppressed without affecting the others; and (3) theirrepresentational structure is consistent across model families and scales.These results suggest that sycophantic behaviors correspond to distinct,independently steerable representations.

Quick Read (beta)

loading the full paper ...