Offline Reinforcement Learning as One Big Sequence Modeling Problem

Abstract

Reinforcement learning (RL) is typically concerned with estimating stationarypolicies or single-step models, leveraging the Markov property to factorizeproblems in time. However, we can also view RL as a generic sequence modelingproblem, with the goal being to produce a sequence of actions that leads to asequence of high rewards. Viewed in this way, it is tempting to considerwhether high-capacity sequence prediction models that work well in otherdomains, such as natural-language processing, can also provide effectivesolutions to the RL problem. To this end, we explore how RL can be tackled withthe tools of sequence modeling, using a Transformer architecture to modeldistributions over trajectories and repurposing beam search as a planningalgorithm. Framing RL as sequence modeling problem simplifies a range of designdecisions, allowing us to dispense with many of the components common inoffline RL algorithms. We demonstrate the flexibility of this approach acrosslong-horizon dynamics prediction, imitation learning, goal-conditioned RL, andoffline RL. Further, we show that this approach can be combined with existingmodel-free algorithms to yield a state-of-the-art planner in sparse-reward,long-horizon tasks.

Quick Read (beta)

loading the full paper ...