Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

Abstract

We consider the offline reinforcement learning problem, where the aim is tolearn a decision making policy from logged data. Offline RL -- particularlywhen coupled with (value) function approximation to allow for generalization inlarge or continuous state spaces -- is becoming increasingly relevant inpractice, because it avoids costly and time-consuming online data collectionand is well suited to safety-critical domains. Existing sample complexityguarantees for offline value function approximation methods typically requireboth (1) distributional assumptions (i.e., good coverage) and (2)representational assumptions (i.e., ability to represent some or all $Q$-valuefunctions) stronger than what is required for supervised learning. However, thenecessity of these conditions and the fundamental limits of offline RL are notwell understood in spite of decades of research. This led Chen and Jiang (2019)to conjecture that concentrability (the most standard notion of coverage) andrealizability (the weakest representation condition) alone are not sufficientfor sample-efficient offline RL. We resolve this conjecture in the positive byproving that in general, even if both concentrability and realizability aresatisfied, any algorithm requires sample complexity polynomial in the size ofthe state space to learn a non-trivial policy. Our results show that sample-efficient offline reinforcement learningrequires either restrictive coverage conditions or representation conditionsthat go beyond supervised learning, and highlight a phenomenon calledover-coverage which serves as a fundamental barrier for offline value functionapproximation methods. A consequence of our results for reinforcement learningwith linear function approximation is that the separation between online andoffline RL can be arbitrarily large, even in constant dimension.

Quick Read (beta)

loading the full paper ...