On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation

Abstract

Sample-efficient offline reinforcement learning (RL) with linear functionapproximation has recently been studied extensively. Much of prior work hasyielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$,with $K$ being the number of episodes in the offline data. In this work, weseek to understand instance-dependent bounds for offline RL with functionapproximation. We present an algorithm called Bootstrapped and ConstrainedPessimistic Value Iteration (BCP-VI), which leverages data bootstrapping andconstrained optimization on top of pessimism. We show that under a partial datacoverage assumption, that of \emph{concentrability} with respect to an optimalpolicy, the proposed algorithm yields a fast rate of$\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gapin the optimal Q-value functions, even when the offline data were adaptivelycollected. Moreover, when the linear features of the optimal actions in thestates reachable by an optimal policy span those reachable by the behaviorpolicy and the optimal actions are unique, offline RL achieves absolute zerosub-optimality error when $K$ exceeds a (finite) instance-dependent threshold.To the best of our knowledge, these are the first$\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality boundrespectively for offline RL with linear function approximation from adaptivedata with partial coverage. We also provide instance-agnostic andinstance-dependent information-theoretical lower bounds to complement our upperbounds.

Quick Read (beta)

loading the full paper ...