Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

  • 2024-11-18 10:20:35
  • Daniel J. Lee, Stefan Heimersheim
  • 0

Abstract

Sensitive directions experiments attempt to understand the computationalfeatures of Language Models (LMs) by measuring how much the next tokenprediction probabilities change by perturbing activations along specificdirections. We extend the sensitive directions work by introducing an improvedbaseline for perturbation directions. We demonstrate that KL divergence forSparse Autoencoder (SAE) reconstruction errors are no longer pathologicallyhigh compared to the improved baseline. We also show that feature directionsuncovered by SAEs have varying impacts on model outputs depending on the SAE'ssparsity, with lower L0 SAE feature directions exerting a greater influence.Additionally, we find that end-to-end SAE features do not exhibit strongereffects on model outputs compared to traditional SAEs.

 

Quick Read (beta)

loading the full paper ...