Nearly Minimax Optimal Reward-free Reinforcement Learning

Abstract

We study the reward-free reinforcement learning framework, which isparticularly suitable for batch reinforcement learning and scenarios where oneneeds policies for multiple reward functions. This framework has two phases. Inthe exploration phase, the agent collects trajectories by interacting with theenvironment without using any reward signal. In the planning phase, the agentneeds to return a near-optimal policy for arbitrary reward functions. We give anew efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated\textbf{P}lanning (\algoname), which interacts with the environment at most$O\left(\frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\epsilon}\right)\right)$ episodes in the exploration phase, and guarantees to output anear-optimal policy for arbitrary reward functions in the planning phase. Here,$S$ is the size of state space, $A$ is the size of action space, $H$ is theplanning horizon, and $\epsilon$ is the target accuracy relative to the totalreward. Notably, our sample complexity scales only \emph{logarithmically} with$H$, in contrast to all existing results which scale \emph{polynomially} with$H$. Furthermore, this bound matches the minimax lower bound$\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition forthe dataset to plan for an $\epsilon$-suboptimal policy; 2) A new way to planefficiently under the proposed condition using soft-truncated planning; 3)Constructing extended MDP to maximize the truncated accumulative rewardsefficiently.

Quick Read (beta)

loading the full paper ...