Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Abstract

Although in recent years reinforcement learning has become very popular thenumber of successful applications to different kinds of operations researchproblems is rather scarce. Reinforcement learning is based on the well-studieddynamic programming technique and thus also aims at finding the best stationarypolicy for a given Markov Decision Process, but in contrast does not requireany model knowledge. The policy is assessed solely on consecutive states (orstate-action pairs), which are observed while an agent explores the solutionspace. The contributions of this paper are manifold. First we provide deeptheoretical insights to the widely applied standard discounted reinforcementlearning framework, which give rise to the understanding of why thesealgorithms are inappropriate when permanently provided with non-zero rewards,such as costs or profit. Second, we establish a novel near-Blackwell-optimalreinforcement learning algorithm. In contrary to former method it assesses theaverage reward per step separately and thus prevents the incautious combinationof different types of state values. Thereby, the Laurent Series expansion ofthe discounted state values forms the foundation for this development and alsoprovides the connection between the two approaches. Finally, we prove theviability of our algorithm on a challenging problem set, which includes awell-studied M/M/1 admission control queuing system. In contrast to standarddiscounted reinforcement learning our algorithm infers the optimal policy onall tested problems. The insights are that in the operations research domainmachine learning techniques have to be adapted and advanced to successfullyapply these methods in our settings.

Quick Read (beta)

loading the full paper ...