Abstract
Understanding generalization in reinforcement learning (RL) is a significantchallenge, as many common assumptions of traditional supervised learning theorydo not apply. We focus on the special class of reparameterizable RL problems,where the trajectory distribution can be decomposed using the reparametrizationtrick. For this problem class, estimating the expected return is efficient andthe trajectory can be computed deterministically given peripheral randomvariables, which enables us to study reparametrizable RL using supervisedlearning and transfer learning theory. Through these relationships, we deriveguarantees on the gap between the expected and empirical return for bothintrinsic and external errors, based on Rademacher complexity as well as thePAC-Bayes bound. Our bound suggests the generalization capability ofreparameterizable RL is related to multiple factors including "smoothness" ofthe environment transition, reward and agent policy function class. We alsoempirically verify the relationship between the generalization gap and thesefactors through simulations.