Reinforcement Learning in Rich-Observation MDPs using Spectral Methods

Abstract

Reinforcement learning (RL) in Markov decision processes (MDPs) with largestate spaces is a challenging problem. The performance of standard RLalgorithms degrades drastically with the dimensionality of state space.However, in practice, these large MDPs typically incorporate a latent or hiddenlow-dimensional structure. In this paper, we study the setting ofrich-observation Markov decision processes (ROMDP), where there are a smallnumber of hidden states which possess an injective mapping to the observationstates. In other words, every observation state is generated through a singlehidden state, and this mapping is unknown a priori. We introduce a spectraldecomposition method that consistently learns this mapping, and moreimportantly, achieves it with low regret. The estimated mapping is integratedinto an optimistic RL algorithm (UCRL), which operates on the estimated hiddenspace. We derive finite-time regret bounds for our algorithm with a weakdependence on the dimensionality of the observed space. In fact, our algorithmasymptotically achieves the same average regret as the oracle UCRL algorithm,which has the knowledge of the mapping from hidden to observed spaces. Thus, wederive an efficient spectral RL algorithm for ROMDPs.

Quick Read (beta)

loading the full paper ...