Sequence Modeling is a Robust Contender for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) allows agents to learn effective,return-maximizing policies from a static dataset. Three major paradigms foroffline RL are Q-Learning, Imitation Learning, and Sequence Modeling. A keyopen question is: which paradigm is preferred under what conditions? We studythis question empirically by exploring the performance of representativealgorithms -- Conservative Q-Learning (CQL), Behavior Cloning (BC), andDecision Transformer (DT) -- across the commonly used D4RL and Robomimicbenchmarks. We design targeted experiments to understand their behaviorconcerning data suboptimality and task complexity. Our key findings are: (1)Sequence Modeling requires more data than Q-Learning to learn competitivepolicies but is more robust; (2) Sequence Modeling is a substantially betterchoice than both Q-Learning and Imitation Learning in sparse-reward andlow-quality data settings; and (3) Sequence Modeling and Imitation Learning arepreferable as task horizon increases, or when data is obtained from suboptimalhuman demonstrators. Based on the overall strength of Sequence Modeling, wealso investigate architectural choices and scaling trends for DT on Atari andD4RL and make design recommendations. We find that scaling the amount of datafor DT by 5x gives a 2.5x average score improvement on Atari.

Quick Read (beta)

loading the full paper ...