Abstract
Scalable oversight, the process by which weaker AI systems supervise strongerones, has been proposed as a key strategy to control future superintelligentsystems. However, it is still unclear how scalable oversight itself scales. Toaddress this gap, we propose a framework that quantifies the probability ofsuccessful oversight as a function of the capabilities of the overseer and thesystem being overseen. Specifically, our framework models oversight as a gamebetween capability-mismatched players; the players have oversight-specific anddeception-specific Elo scores that are a piecewise-linear function of theirgeneral intelligence, with two plateaus corresponding to task incompetence andtask saturation. We validate our framework with a modified version of the gameNim and then apply it to four oversight games: "Mafia", "Debate", "BackdoorCode" and "Wargames". For each game, we find scaling laws that approximate howdomain performance depends on general AI system capability (using Chatbot ArenaElo as a proxy for general capability). We then build on our findings in atheoretical study of Nested Scalable Oversight (NSO), a process in whichtrusted models oversee untrusted stronger models, which then become the trustedmodels in the next step. We identify conditions under which NSO succeeds andderive numerically (and in some cases analytically) the optimal number ofoversight levels to maximize the probability of oversight success. In ournumerical examples, the NSO success rate is below 52% when overseeing systemsthat are 400 Elo points stronger than the baseline overseer, and it declinesfurther for overseeing even stronger systems.