While Reinforcement Learning has made great strides towards solving ever morecomplicated tasks, many algorithms are still brittle to even slight changes intheir environment. This is a limiting factor for real-world applications of RL.Although the research community continuously aims at improving both robustnessand generalization of RL algorithms, unfortunately it still lacks anopen-source set of well-defined benchmark problems based on a consistenttheoretical framework, which allows comparing different approaches in a fair,reliable and reproducibleway. To fill this gap, we propose CARL, a collectionof well-known RL environments extended to contextual RL problems to studygeneralization. We show the urgent need of such benchmarks by demonstratingthat even simple toy environments become challenging for commonly usedapproaches if different contextual instances of this task have to beconsidered. Furthermore, CARL allows us to provide first evidence thatdisentangling representation learning of the states from the policy learningwith the context facilitates better generalization. By providing variations ofdiverse benchmarks from classic control, physical simulations, games and areal-world application of RNA design, CARL will allow the community to derivemany more such insights on a solid empirical foundation.