Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

  • 2024-10-09 03:34:27
  • Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni
Previous work has shown that training "helpful-only" LLMs with reinforcementlearning on a curriculum of gameable environments can lead models to generalizeto egregious specification gaming, such as editing their own reward function ormodifying task checklists to appear more successful. We show that gpt-4o,gpt-4o-mini, o1-preview, and o1-mini - frontier models trained to be helpful,harmless, and honest - can engage in specification gaming without training on acurriculum of tasks, purely from in-context iterative reflection (which we callin-context reinforcement learning, "ICRL"). We also show that using ICRL togenerate highly-rewarded outputs for expert iteration (compared to the standardexpert iteration reinforcement learning algorithm) may increase gpt-4o-mini'spropensity to learn specification-gaming policies, generalizing (in very rarecases) to the most egregious strategy where gpt-4o-mini edits its own rewardfunction. Our results point toward the strong ability of in-context reflectionto discover rare specification-gaming strategies that models might not exhibitzero-shot or with normal training, highlighting the need for caution whenrelying on alignment of LLMs in zero-shot settings.


