Offline Trajectory Generalization for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) aims to learn policies from staticdatasets of previously collected trajectories. Existing methods for offline RLeither constrain the learned policy to the support of offline data or utilizemodel-based virtual environments to generate simulated rollouts. However, thesemethods suffer from (i) poor generalization to unseen states; and (ii) trivialimprovement from low-qualified rollout simulation. In this paper, we proposeoffline trajectory generalization through world transformers for offlinereinforcement learning (OTTO). Specifically, we use casual Transformers, a.k.a.World Transformers, to predict state dynamics and the immediate reward. Then wepropose four strategies to use World Transformers to generate high-rewardedtrajectory simulation by perturbing the offline data. Finally, we jointly useoffline data with simulated data to train an offline RL algorithm. OTTO servesas a plug-in module and can be integrated with existing offline RL methods toenhance them with better generalization capability of transformers andhigh-rewarded data augmentation. Conducting extensive experiments on D4RLbenchmark datasets, we verify that OTTO significantly outperformsstate-of-the-art offline RL methods.

Quick Read (beta)

loading the full paper ...