A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning

Abstract

While reinforcement learning (RL) provides a framework for learning throughtrial and error, translating RL algorithms into the real world has remainedchallenging. A major hurdle to real-world application arises from thedevelopment of algorithms in an episodic setting where the environment is resetafter every trial, in contrast with the continual and non-episodic nature ofthe real-world encountered by embodied agents such as humans and robots. Priorworks have considered an alternating approach where a forward policy learns tosolve the task and the backward policy learns to reset the environment, butwhat initial state distribution should the backward policy reset the agent to?Assuming access to a few demonstrations, we propose a new method, MEDAL, thattrains the backward policy to match the state distribution in the provideddemonstrations. This keeps the agent close to the task-relevant states,allowing for a mix of easy and difficult starting states for the forwardpolicy. Our experiments show that MEDAL matches or outperforms prior methods onthree sparse-reward continuous control tasks from the EARL benchmark, with 40%gains on the hardest task, while making fewer assumptions than prior works.

Quick Read (beta)

loading the full paper ...