Deep Reinforcement Learning at the Edge of the Statistical Precipice

Abstract

Deep reinforcement learning (RL) algorithms are predominantly evaluated bycomparing their relative performance on a large suite of tasks. Most publishedresults on deep RL benchmarks compare point estimates of aggregate performancesuch as mean and median scores across tasks, ignoring the statisticaluncertainty implied by the use of a finite number of training runs. Beginningwith the Arcade Learning Environment (ALE), the shift towardscomputationally-demanding benchmarks has led to the practice of evaluating onlya small number of runs per task, exacerbating the statistical uncertainty inpoint estimates. In this paper, we argue that reliable evaluation in the fewrun deep RL regime cannot ignore the uncertainty in results without running therisk of slowing down progress in the field. We illustrate this point using acase study on the Atari 100k benchmark, where we find substantial discrepanciesbetween conclusions drawn from point estimates alone versus a more thoroughstatistical analysis. With the aim of increasing the field's confidence inreported results with a handful of runs, we advocate for reporting intervalestimates of aggregate performance and propose performance profiles to accountfor the variability in results, as well as present more robust and efficientaggregate metrics, such as interquartile mean scores, to achieve smalluncertainty in results. Using such statistical tools, we scrutinize performanceevaluations of existing algorithms on other widely used RL benchmarks includingthe ALE, Procgen, and the DeepMind Control Suite, again revealing discrepanciesin prior comparisons. Our findings call for a change in how we evaluateperformance in deep RL, for which we present a more rigorous evaluationmethodology, accompanied with an open-source library rliable, to preventunreliable results from stagnating the field.

Quick Read (beta)

loading the full paper ...