AssistanceZero: Scalably Solving Assistance Games

Abstract

Assistance games are a promising alternative to reinforcement learning fromhuman feedback (RLHF) for training AI assistants. Assistance games resolve keydrawbacks of RLHF, such as incentives for deceptive behavior, by explicitlymodeling the interaction between assistant and user as a two-player game wherethe assistant cannot observe their shared goal. Despite their potential,assistance games have only been explored in simple settings. Scaling them tomore complex environments is difficult because it requires both solvingintractable decision-making problems under uncertainty and accurately modelinghuman users' behavior. We present the first scalable approach to solvingassistance games and apply it to a new, challenging Minecraft-based assistancegame with over $10^{400}$ possible goals. Our approach, AssistanceZero, extendsAlphaZero with a neural network that predicts human actions and rewards,enabling it to plan under uncertainty. We show that AssistanceZero outperformsmodel-free RL algorithms and imitation learning in the Minecraft-basedassistance game. In a human study, our AssistanceZero-trained assistantsignificantly reduces the number of actions participants take to completebuilding tasks in Minecraft. Our results suggest that assistance games are atractable framework for training effective AI assistants in complexenvironments. Our code and models are available athttps://github.com/cassidylaidlaw/minecraft-building-assistance-game.

Quick Read (beta)

loading the full paper ...