Reinforcement learning (RL) has shown to reach super human-level performanceacross a wide range of tasks. However, unlike supervised machine learning,learning strategies that generalize well to a wide range of situations remainsone of the most challenging problems for real-world RL. Autonomous driving (AD)provides a multi-faceted experimental field, as it is necessary to learn thecorrect behavior over many variations of road layouts and large distributionsof possible traffic situations, including individual driver personalities andhard-to-predict traffic events. In this paper we propose a challengingbenchmark for generalizable RL for AD based on a configurable, flexible, andperformant code base. Our benchmark uses a catalog of randomized scenariogenerators, including multiple mechanisms for road layout and trafficvariations, different numerical and visual observation types, distinct actionspaces, diverse vehicle models, and allows for use under static scenariodefinitions. In addition to purely algorithmic insights, ourapplication-oriented benchmark also enables a better understanding of theimpact of design decisions such as action and observation space on thegeneralizability of policies. Our benchmark aims to encourage researchers topropose solutions that are able to successfully generalize across scenarios, atask in which current RL methods fail. The code for the benchmark is availableat https://github.com/seawee1/driver-dojo.