Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Abstract

Offline reinforcement learning aims to learn from pre-collected datasetswithout active exploration. This problem faces significant challenges,including limited data availability and distributional shifts. Existingapproaches adopt a pessimistic stance towards uncertainty by penalizing rewardsof under-explored state-action pairs to estimate value functionsconservatively. In this paper, we show that the distributionally robustoptimization (DRO) based approach can also address these challenges and isminimax optimal. Specifically, we directly model the uncertainty in thetransition kernel and construct an uncertainty set of statistically plausibletransition kernels. We then find the policy that optimizes the worst-caseperformance over this uncertainty set. We first design a metric-basedHoeffding-style uncertainty set such that with high probability the truetransition kernel is in this set. We prove that to achieve a sub-optimality gapof $\epsilon$, the sample complexity is$\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is thediscount factor, $S$ is the number of states, and $C^{\pi^*}$ is thesingle-policy clipped concentrability coefficient which quantifies thedistribution shift. To achieve the optimal sample complexity, we furtherpropose a less conservative Bernstein-style uncertainty set, which, however,does not necessarily include the true transition kernel. We show that animproved sample complexity of$\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, whichmatches with the minimax lower bound for offline reinforcement learning, andthus is minimax optimal.

Quick Read (beta)

loading the full paper ...