Abstract
End-to-end autonomous driving has garnered widespread attention. Currentend-to-end approaches largely rely on the supervision from perception taskssuch as detection, tracking, and map segmentation to aid in learning scenerepresentations. However, these methods require extensive annotations,hindering the data scalability. To address this challenge, we propose a novelself-supervised method to enhance end-to-end driving without the need forcostly labels. Specifically, our framework \textbf{LAW} uses a LAtent Worldmodel to predict future latent features based on the predicted ego actions andthe latent feature of the current frame. The predicted latent features aresupervised by the actually observed features in the future. This supervisionjointly optimizes the latent feature learning and action prediction, whichgreatly enhances the driving performance. As a result, our approach achievesstate-of-the-art performance in both open-loop and closed-loop benchmarkswithout costly annotations.