Abstract
Empirical design in reinforcement learning is no small task. Running goodexperiments requires attention to detail and at times significant computationalresources. While compute resources available per dollar have continued to growrapidly, so have the scale of typical experiments in reinforcement learning. Itis now common to benchmark agents with millions of parameters against dozens oftasks, each using the equivalent of 30 days of experience. The scale of theseexperiments often conflict with the need for proper statistical evidence,especially when comparing algorithms. Recent studies have highlighted howpopular algorithms are sensitive to hyper-parameter settings and implementationdetails, and that common empirical practice leads to weak statistical evidence(Machado et al., 2018; Henderson et al., 2018). Here we take this one stepfurther. This manuscript represents both a call to action, and a comprehensiveresource for how to do good experiments in reinforcement learning. Inparticular, we cover: the statistical assumptions underlying common performancemeasures, how to properly characterize performance variation and stability,hypothesis testing, special considerations for comparing multiple agents,baseline and illustrative example construction, and how to deal withhyper-parameters and experimenter bias. Throughout we highlight common mistakesfound in the literature and the statistical consequences of those in exampleexperiments. The objective of this document is to provide answers on how we canuse our unprecedented compute to do good science in reinforcement learning, aswell as stay alert to potential pitfalls in our empirical design.