On the Sample Complexity of Vanilla Model-Based Offline Reinforcement Learning with Dependent Samples

Abstract

Offline reinforcement learning (offline RL) considers problems where learningis performed using only previously collected samples and is helpful for thesettings in which collecting new data is costly or risky. In model-basedoffline RL, the learner performs estimation (or optimization) using a modelconstructed according to the empirical transition frequencies. We analyze thesample complexity of vanilla model-based offline RL with dependent samples inthe infinite-horizon discounted-reward setting. In our setting, the samplesobey the dynamics of the Markov decision process and, consequently, may haveinterdependencies. Under no assumption of independent samples, we provide ahigh-probability, polynomial sample complexity bound for vanilla model-basedoff-policy evaluation that requires partial or uniform coverage. We extend thisresult to the off-policy optimization under uniform coverage. As a comparisonto the model-based approach, we analyze the sample complexity of off-policyevaluation with vanilla importance sampling in the infinite-horizon setting.Finally, we provide an estimator that outperforms the sample-mean estimator foralmost deterministic dynamics that are prevalent in reinforcement learning.

Quick Read (beta)

loading the full paper ...