Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

Abstract

Large language models (LLMs) are trained on large corpora, yet it is unclearwhether they can reason about the information present within their trainingdata. We design experiments to study out-of-context abduction in LLMs, theability to infer the most plausible explanations for observations usingrelevant facts present in training data. We train treatment LLMs on names andbehavior descriptions of fictitious chatbots, but not on examples of dialoguewith the chatbots. We find that OpenAI's GPT 4o LLM can correctly infer atleast one chatbot's name after observing example responses characteristic ofthat chatbot. We also find that previously training GPT 4o on descriptions of achatbot's behavior allows it to display behaviors more characteristic of thechatbot when iteratively trained to display such behaviors. Our results haveimplications for situational awareness in LLMs and, therefore, for AI safety.

Quick Read (beta)

loading the full paper ...