Pessimistic Model Selection for Offline Deep Reinforcement Learning

Abstract

Deep Reinforcement Learning (DRL) has demonstrated great potentials insolving sequential decision making problems in many applications. Despite itspromising performance, practical gaps exist when deploying DRL in real-worldscenarios. One main barrier is the over-fitting issue that leads to poorgeneralizability of the policy learned by DRL. In particular, for offline DRLwith observational data, model selection is a challenging task as there is noground truth available for performance demonstration, in contrast with theonline setting with simulated environments. In this work, we propose apessimistic model selection (PMS) approach for offline DRL with a theoreticalguarantee, which features a provably effective framework for finding the bestpolicy among a set of candidate models. Two refined approaches are alsoproposed to address the potential bias of DRL model in identifying the optimalpolicy. Numerical studies demonstrated the superior performance of our approachover existing methods.

Quick Read (beta)

loading the full paper ...