In recent years, significant progress has been made in solving challengingproblems across various domains using deep reinforcement learning (RL).Reproducing existing work and accurately judging the improvements offered bynovel methods is vital to sustaining this progress. Unfortunately, reproducingresults for state-of-the-art deep RL methods is seldom straightforward. Inparticular, non-determinism in standard benchmark environments, combined withvariance intrinsic to the methods, can make reported results tough tointerpret. Without significance metrics and tighter standardization ofexperimental reporting, it is difficult to determine whether improvements overthe prior state-of-the-art are meaningful. In this paper, we investigatechallenges posed by reproducibility, proper experimental techniques, andreporting procedures. We illustrate the variability in reported metrics andresults when comparing against common baselines and suggest guidelines to makefuture results in deep RL more reproducible. We aim to spur discussion abouthow to ensure continued progress in the field by minimizing wasted effortstemming from results that are non-reproducible and easily misinterpreted.