Posterior Sampling for Large Scale Reinforcement Learning

Abstract

Posterior sampling for reinforcement learning (PSRL) is a popular algorithmfor learning to control an unknown Markov decision process (MDP). PSRLmaintains a distribution over MDP parameters and in an episodic fashion samplesMDP parameters, computes the optimal policy for them and executes it. A specialcase of PSRL is where at the end of each episode the MDP resets to the initialstate distribution. Extensions of this idea to general MDPs without stateresetting has so far produced non-practical algorithms and in some cases buggytheoretical analysis. This is due to the difficulty of analyzing regret underepisode switching schedules that depend on random variables of the trueunderlying model. We propose a solution to this problem that involves using adeterministic, model-independent episode switching schedule, and establish aBayes regret bound under mild assumptions. Our algorithm termed deterministicschedule PSRL (DS-PSRL) is efficient in terms of time, sample, and spacecomplexity. Our result is more generally applicable to continuous state actionproblems. We demonstrate how this algorithm is well suited for sequentialrecommendation problems such as points of interest (POI). We derive a generalprocedure for parameterizing the underlying MDPs, to create action conditiondynamics from passive data, that do not contain actions. We prove that suchparameterization satisfies the assumptions of our analysis.

Quick Read (beta)

loading the full paper ...