Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Abstract

We study reinforcement learning for partially observed Markov decisionprocesses (POMDPs) with infinite observation and state spaces, which remainsless investigated theoretically. To this end, we make the first attempt atbridging partial observability and function approximation for a class of POMDPswith a linear structure. In detail, we propose a reinforcement learningalgorithm (Optimistic Exploration via Adversarial Integral Equation orOP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$episodes. In particular, the sample complexity scales polynomially in theintrinsic dimension of the linear structure and is independent of the size ofthe observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients:(i) a Bellman operator with finite memory, which represents the value functionin a recursive manner, (ii) the identification and estimation of such anoperator via an adversarial integral equation, which features a smootheddiscriminator tailored to the linear structure, and (iii) the exploration ofthe observation and state spaces via optimism, which is based on quantifyingthe uncertainty in the adversarial integral equation.

Quick Read (beta)

loading the full paper ...