TOM: Learning Policy-Aware Models for Model-Based Reinforcement Learning via Transition Occupancy Matching

Abstract

Standard model-based reinforcement learning (MBRL) approaches fit atransition model of the environment to all past experience, but this wastesmodel capacity on data that is irrelevant for policy improvement. We insteadpropose a new "transition occupancy matching" (TOM) objective for MBRL modellearning: a model is good to the extent that the current policy experiences thesame distribution of transitions inside the model as in the real environment.We derive TOM directly from a novel lower bound on the standard reinforcementlearning objective. To optimize TOM, we show how to reduce it to a form ofimportance weighted maximum-likelihood estimation, where the automaticallycomputed importance weights identify policy-relevant past experiences from areplay buffer, enabling stable optimization. TOM thus offers a plug-and-playmodel learning sub-routine that is compatible with any backbone MBRL algorithm.On various Mujoco continuous robotic control tasks, we show that TOMsuccessfully focuses model learning on policy-relevant experience and drivespolicies faster to higher task rewards than alternative model learningapproaches.

Quick Read (beta)

loading the full paper ...