SMiRL: Surprise Minimizing Reinforcement Learning in Dynamic Environments

Abstract

All living organisms struggle against the forces of nature to carve out amaintainable niche. We propose that such a search for order amidst chaos mightoffer a unifying principle for the emergence of useful behaviors in artificialagents. We formalize this idea into an unsupervised reinforcement learningmethod called surprise minimizing reinforcement learning (SMiRL). SMiRLalternates between learning a density model to evaluate the surprise of astimulus, and improving the policy to seek more predictable stimuli. Thisprocess maximizes a lower-bound on the negative entropy of the states, whichcan be seen as maximizing the agent's ability to maintain order in theenvironment. The policy seeks out stable and repeatable situations thatcounteract the environment's prevailing sources of entropy. This might includeavoiding other hostile agents, or finding a stable, balanced pose for a bipedalrobot in the face of disturbance forces. We demonstrate that our surpriseminimizing agents can successfully play Tetris, Doom, control a humanoid toavoid falls, and navigate to escape enemies in a maze without any task-specificreward supervision. We further show that SMiRL can be used together withstandard task rewards to accelerate reward-driven learning.

Quick Read (beta)

loading the full paper ...