Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL), where the agent aims to learn theoptimal policy based on the data collected by a behavior policy, has attractedincreasing attention in recent years. While offline RL with linear functionapproximation has been extensively studied with optimal results achieved undercertain assumptions, many works shift their interest to offline RL withnon-linear function approximation. However, limited works on offline RL withnon-linear function approximation have instance-dependent regret guarantees. Inthis paper, we propose an oracle-efficient algorithm, dubbed PessimisticNonlinear Least-Square Value Iteration (PNLSVI), for offline RL with non-linearfunction approximation. Our algorithmic design comprises three innovativecomponents: (1) a variance-based weighted regression scheme that can be appliedto a wide range of function classes, (2) a subroutine for variance estimation,and (3) a planning phase that utilizes a pessimistic value iteration approach.Our algorithm enjoys a regret bound that has a tight dependency on the functionclass complexity and achieves minimax optimal instance-dependent regret whenspecialized to linear function approximation. Our work extends the previousinstance-dependent results within simpler function classes, such as linear anddifferentiable function to a more general framework.

Quick Read (beta)

loading the full paper ...