Abstract
Adversarial self-play in two-player games has delivered impressive resultswhen used with reinforcement learning algorithms that combine deep neuralnetworks and tree search. Algorithms like AlphaZero and Expert Iteration learntabula-rasa, producing highly informative training data on the fly. However,the self-play training strategy is not directly applicable to single-playergames. Recently, several practically important combinatorial optimizationproblems, such as the traveling salesman problem and the bin packing problem,have been reformulated as reinforcement learning problems, increasing theimportance of enabling the benefits of self-play beyond two-player games. Wepresent the Ranked Reward (R2) algorithm which accomplishes this by ranking therewards obtained by a single agent over multiple games to create a relativeperformance metric. Results from applying the R2 algorithm to instances of atwo-dimensional bin packing problem show that it outperforms generic MonteCarlo tree search, heuristic algorithms and reinforcement learning algorithmsnot using ranked rewards.