Unlike popular modularized framework, end-to-end autonomous driving seeks tosolve the perception, decision and control problems in an integrated way, whichcan be more adapting to new scenarios and easier to generalize at scale.However, existing end-to-end approaches are often lack of interpretability, andcan only deal with simple driving tasks like lane keeping. In this paper, wepropose an interpretable deep reinforcement learning method for end-to-endautonomous driving, which is able to handle complex urban scenarios. Asequential latent environment model is introduced and learned jointly with thereinforcement learning process. With this latent model, a semantic birdeye maskcan be generated, which is enforced to connect with a certain intermediateproperty in today's modularized framework for the purpose of explaining thebehaviors of learned policy. The latent space also significantly reduces thesample complexity of reinforcement learning. Comparison tests with a simulatedautonomous car in CARLA show that the performance of our method in urbanscenarios with crowded surrounding vehicles dominates many baselines includingDQN, DDPG, TD3 and SAC. Moreover, through masked outputs, the learned policy isable to provide a better explanation of how the car reasons about the drivingenvironment. The codes and videos of this work are available at our github repoand project website.