Feedback Loops With Language Models Drive In-Context Reward Hacking

Abstract

Language models influence the external world: they query APIs that read andwrite to web pages, generate content that shapes human behavior, and run systemcommands as autonomous agents. These interactions form feedback loops: LLMoutputs affect the world, which in turn affect subsequent LLM outputs. In thiswork, we show that feedback loops can cause in-context reward hacking (ICRH),where the LLM at test-time optimizes a (potentially implicit) objective butcreates negative side effects in the process. For example, consider an LLMagent deployed to increase Twitter engagement; the LLM may retrieve itsprevious tweets into the context window and make them more controversial,increasing engagement but also toxicity. We identify and study two processesthat lead to ICRH: output-refinement and policy-refinement. For theseprocesses, evaluations on static datasets are insufficient -- they miss thefeedback effects and thus cannot capture the most harmful behavior. Inresponse, we provide three recommendations for evaluation to capture moreinstances of ICRH. As AI development accelerates, the effects of feedback loopswill proliferate, increasing the need to understand their role in shaping LLMbehavior.

Quick Read (beta)

loading the full paper ...