### Abstract

Offline reinforcement learning aims to find the optimal policy from apre-collected dataset without active exploration. This problem is faced withmajor challenges, such as a limited amount of data and distribution shift.Existing studies employ the principle of pessimism in face of uncertainty, andpenalize rewards for less visited state-action pairs. In this paper, wedirectly model the uncertainty in the transition kernel using an uncertaintyset, and then employ the approach of distributionally robust optimization thatoptimizes the worst-case performance over the uncertainty set. We first designa Hoeffding-style uncertainty set, which guarantees that the true transitionkernel lies in the uncertainty set with high probability. We theoreticallyprove that it achieves an $\epsilon$-accuracy with a sample complexity of$\mathcal{O}\left((1-\gamma)^{-4}\epsilon^{-2}SC^{\pi^*} \right)$, where$\gamma$ is the discount factor, $C^{\pi^*}$ is the single-policyconcentrability for any comparator policy $\pi^*$, and $S$ is the number ofstates. We further design a Bernstein-style uncertainty set, which does notnecessarily guarantee the true transition kernel lies in the uncertainty set.We show an improved and near-optimal sample complexity of$\mathcal{O}\left((1-\gamma)^{-3}\epsilon^{-2}\left(SC^{\pi^*}+(\mu_{\min})^{-1}\right)\right)$, where $\mu_{\min}$ denotes the minimal non-zero entry of the behaviordistribution. In addition, the computational complexity of our algorithms isthe same as one of the LCB-based methods in the literature. Our resultsdemonstrate that distributionally robust optimization method can alsoefficiently solve offline reinforcement learning.