Abstract
We consider policy optimization methods in reinforcement learning settingswhere the state space is arbitrarily large, or even countably infinite. Themotivation arises from control problems in communication networks, matchingmarkets, and other queueing systems. We consider Natural Policy Gradient (NPG),which is a popular algorithm for finite state spaces. Under reasonableassumptions, we derive a performance bound for NPG that is independent of thesize of the state space, provided the error in policy evaluation is within afactor of the true value function. We obtain this result by establishing newpolicy-independent bounds on the solution to Poisson's equation, i.e., therelative value function, and by combining these bounds with previously knownconnections between MDPs and learning from experts.