Deep reinforcement learning (RL) has achieved breakthrough results on manytasks, but has been shown to be sensitive to system changes at test time. As aresult, building deep RL agents that generalize has become an active researcharea. Our aim is to catalyze and streamline community-wide progress on thisproblem by providing the first benchmark and a common experimental protocol forinvestigating generalization in RL. Our benchmark contains a diverse set ofenvironments and our evaluation methodology covers both in-distribution andout-of-distribution generalization. To provide a set of baselines for futureresearch, we conduct a systematic evaluation of deep RL algorithms, includingthose that specifically tackle the problem of generalization.