AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Abstract

Trustworthy capability evaluations are crucial for ensuring the safety of AIsystems, and are becoming a key component of AI regulation. However, thedevelopers of an AI system, or the AI system itself, may have incentives forevaluations to understate the AI's actual capability. These conflictinginterests lead to the problem of sandbagging $\unicode{x2013}$ which we defineas "strategic underperformance on an evaluation". In this paper we assesssandbagging capabilities in contemporary language models (LMs). We promptfrontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform ondangerous capability evaluations, while maintaining performance on general(harmless) capability evaluations. Moreover, we find that models can befine-tuned, on a synthetic dataset, to hide specific capabilities unless givena password. This behaviour generalizes to high-quality, held-out benchmarkssuch as WMDP. In addition, we show that both frontier and smaller models can beprompted, or password-locked, to target specific scores on a capabilityevaluation. Even more, we found that a capable password-locked model (Llama 370b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall,our results suggest that capability evaluations are vulnerable to sandbagging.This vulnerability decreases the trustworthiness of evaluations, and therebyundermines important safety decisions regarding the development and deploymentof advanced AI systems.

Quick Read (beta)

loading the full paper ...