Language Models (Mostly) Know What They Know

  • 2022-07-11 23:59:39
  • Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, Jared Kaplan
  • 88

Abstract

We study whether language models can evaluate the validity of their ownclaims and predict which questions they will be able to answer correctly. Wefirst show that larger models are well-calibrated on diverse multiple choiceand true/false questions when they are provided in the right format. Thus wecan approach self-evaluation on open-ended sampling tasks by asking models tofirst propose answers, and then to evaluate the probability "P(True)" thattheir answers are correct. We find encouraging performance, calibration, andscaling for P(True) on a diverse array of tasks. Performance at self-evaluationfurther improves when we allow models to consider many of their own samplesbefore predicting the validity of one specific possibility. Next, weinvestigate whether models can be trained to predict "P(IK)", the probabilitythat "I know" the answer to a question, without reference to any particularproposed answer. Models perform well at predicting P(IK) and partiallygeneralize across tasks, though they struggle with calibration of P(IK) on newtasks. The predicted P(IK) probabilities also increase appropriately in thepresence of relevant source materials in the context, and to the presence ofhints towards the solution of mathematical word problems. We hope theseobservations lay the groundwork for training more honest models, and forinvestigating how honesty generalizes to cases where models are trained onobjectives other than the imitation of human writing.

 

Quick Read (beta)

loading the full paper ...