Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

Abstract

Partial Observability -- where agents can only observe partial informationabout the true underlying state of the system -- is ubiquitous in real-worldapplications of Reinforcement Learning (RL). Theoretically, learning anear-optimal policy under partial observability is known to be hard in theworst case due to an exponential sample complexity lower bound. Recent work hasidentified several tractable subclasses that are learnable with polynomialsamples, such as Partially Observable Markov Decision Processes (POMDPs) withcertain revealing or decodability conditions. However, this line of research isstill in its infancy, where (1) unified structural conditions enablingsample-efficient learning are lacking; (2) existing sample complexities forknown tractable subclasses are far from sharp; and (3) fewer sample-efficientalgorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL inthe general setting of Predictive State Representations (PSRs). First, wepropose a natural and unified structural condition for PSRs called\emph{B-stability}. B-stable PSRs encompasses the vast majority of knowntractable subclasses such as weakly revealing POMDPs, low-rankfuture-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we showthat any B-stable PSR can be learned with polynomial samples in relevantproblem parameters. When instantiated in the aforementioned subclasses, oursample complexities improve substantially over the current best ones. Finally,our results are achieved by three algorithms simultaneously: Optimistic MaximumLikelihood Estimation, Estimation-to-Decisions, and Model-Based OptimisticPosterior Sampling. The latter two algorithms are new for sample-efficientlearning of POMDPs/PSRs.

Quick Read (beta)

loading the full paper ...