An alignment safety case sketch based on debate

Abstract

If AI systems match or exceed human capabilities on a wide range of tasks, itmay become difficult for humans to efficiently judge their actions -- making ithard to use human feedback to steer them towards desirable traits. One proposedsolution is to leverage another superhuman system to point out flaws in thesystem's outputs via a debate. This paper outlines the value of debate for AIsafety, as well as the assumptions and further research required to make debatework. It does so by sketching an ``alignment safety case'' -- an argument thatan AI system will not autonomously take actions which could lead to egregiousharm, despite being able to do so. The sketch focuses on the risk of an AI R\&Dagent inside an AI company sabotaging research, for example by producing falseresults. To prevent this, the agent is trained via debate, subject toexploration guarantees, to teach the system to be honest. Honesty is maintainedthroughout deployment via online training. The safety case rests on four keyclaims: (1) the agent has become good at the debate game, (2) good performancein the debate game implies that the system is mostly honest, (3) the systemwill not become significantly less honest during deployment, and (4) thedeployment context is tolerant of some errors. We identify open researchproblems that, if solved, could render this a compelling argument that an AIsystem is safe.

Quick Read (beta)

loading the full paper ...