Model-Based Exploration in Monitored Markov Decision Processes

Abstract

A tenet of reinforcement learning is that the agent always observes rewards.However, this is not true in many realistic settings, e.g., a human observermay not always be available to provide rewards, sensors may be limited ormalfunctioning, or rewards may be inaccessible during deployment. MonitoredMarkov decision processes (Mon-MDPs) have recently been proposed to model suchsettings. However, existing Mon-MDP algorithms have several limitations: theydo not fully exploit the problem structure, cannot leverage a known monitor,lack worst-case guarantees for 'unsolvable' Mon-MDPs without specificinitialization, and offer only asymptotic convergence proofs. This paper makesthree contributions. First, we introduce a model-based algorithm for Mon-MDPsthat addresses these shortcomings. The algorithm employs two instances ofmodel-based interval estimation: one to ensure that observable rewards arereliably captured, and another to learn the minimax-optimal policy. Second, weempirically demonstrate the advantages. We show faster convergence than prioralgorithms in over four dozen benchmarks, and even more dramatic improvementwhen the monitoring process is known. Third, we present the first finite-samplebound on performance. We show convergence to a minimax-optimal policy even whensome rewards are never observable.

Quick Read (beta)

loading the full paper ...