Evaluating Frontier Models for Stealth and Situational Awareness

Abstract

Recent work has demonstrated the plausibility of frontier AI models scheming-- knowingly and covertly pursuing an objective misaligned with its developer'sintentions. Such behavior could be very hard to detect, and if present infuture advanced systems, could pose severe loss of control risk. It istherefore important for AI developers to rule out harm from scheming prior tomodel deployment. In this paper, we present a suite of scheming reasoningevaluations measuring two types of reasoning capabilities that we believe areprerequisites for successful scheming: First, we propose five evaluations ofability to reason about and circumvent oversight (stealth). Second, we presenteleven evaluations for measuring a model's ability to instrumentally reasonabout itself, its environment and its deployment (situational awareness). Wedemonstrate how these evaluations can be used as part of a scheming inabilitysafety case: a model that does not succeed on these evaluations is almostcertainly incapable of causing severe harm via scheming in real deployment. Werun our evaluations on current frontier models and find that none of them showconcerning levels of either situational awareness or stealth.

Quick Read (beta)

loading the full paper ...