Eliciting Latent Knowledge from Quirky Language Models

Abstract

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neuralnetwork's activations that robustly track the true state of the world,especially in hard-to-verify cases where the model's output is untrusted. Tofurther ELK research, we introduce 12 datasets and a corresponding suite of"quirky" language models (LMs) that are finetuned to make systematic errorswhen answering questions if and only if the keyword "Bob" is present in theprompt. We find that, especially in middle layers, linear probes usually reportan LM's knowledge independently of what the LM outputs, enabling us to elicitthe correct answer despite the model's untruthful output. The best probingmethod (logistic regression on contrast pairs) recovers 89% of the gap in AUROCbetween truthful and untruthful contexts, and 75% for questions harder thanthose used to train the probe. We also find that a mechanistic anomalydetection approach can flag untruthful behavior with 0.95 AUROC. Our resultsshow promise for eliciting reliable knowledge from capable but untrustedmodels, and facilitates future research empirically investigating ELK methods.

Quick Read (beta)

loading the full paper ...