Abstract
As large language models (LLMs) become more powerful and are deployed moreautonomously, it will be increasingly important to prevent them from causingharmful outcomes. Researchers have investigated a variety of safety techniquesfor this purpose, e.g. using models to review the outputs of other models, orred-teaming techniques to surface subtle failure modes. However, researchershave not evaluated whether such techniques still ensure safety if the model isitself intentionally trying to subvert them. In this paper, we develop andevaluate pipelines of safety techniques ("protocols") that are robust tointentional subversion. We investigate a scenario in which we want to solve a sequence of programmingproblems, using access to a powerful but untrusted model (in our case, GPT-4),access to a less powerful trusted model (in our case, GPT-3.5), and limitedaccess to high-quality trusted labor. We investigate protocols that aim tonever submit solutions containing backdoors, which we operationalize here aslogical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that theuntrusted model could use to subvert them. One protocol is what we call trustedediting. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 torate the suspiciousness of that code. If the code is below some suspiciousnessthreshold, it is submitted. Otherwise, GPT-3.5 edits the solution to removeparts that seem suspicious and then submits the edited code. Another protocolis untrusted monitoring. This protocol asks GPT-4 to write code, and then asksanother instance of GPT-4 whether the code is backdoored, using varioustechniques to prevent the GPT-4 instances from colluding. These protocolsimprove substantially on simple baselines.