Abstract
In this work, we study offline reinforcement learning (RL) with zero-shotgeneralization property (ZSG), where the agent has access to an offline datasetincluding experiences from different environments, and the goal of the agent isto train a policy over the training environments which performs well on testenvironments without further interaction. Existing work showed that classicaloffline RL fails to generalize to new, unseen environments. We proposepessimistic empirical risk minimization (PERM) and pessimistic proximal policyoptimization (PPPO), which leverage pessimistic policy evaluation to guidepolicy learning and enhance generalization. We show that both PERM and PPPO arecapable of finding a near-optimal policy with ZSG. Our result serves as a firststep in understanding the foundation of the generalization phenomenon inoffline reinforcement learning.