Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization

Abstract

Adversarial self-play in two-player games has delivered impressive resultswhen used with reinforcement learning algorithms that combine deep neuralnetworks and tree search. Algorithms like AlphaZero and Expert Iteration learntabula-rasa, producing highly informative training data on the fly. However,the self-play training strategy is not directly applicable to single-playergames. Recently, several practically important combinatorial optimisationproblems, such as the travelling salesman problem and the bin packing problem,have been reformulated as reinforcement learning problems, increasing theimportance of enabling the benefits of self-play beyond two-player games. Wepresent the Ranked Reward (R2) algorithm which accomplishes this by ranking therewards obtained by a single agent over multiple games to create a relativeperformance metric. Results from applying the R2 algorithm to instances of atwo-dimensional and three-dimensional bin packing problems show that itoutperforms generic Monte Carlo tree search, heuristic algorithms and integerprogramming solvers. We also present an analysis of the ranked rewardmechanism, in particular, the effects of problem instances with varyingdifficulty and different ranking thresholds.

Quick Read (beta)

loading the full paper ...