Abstract
Learning robust driving policies from large-scale, real-world datasets is acentral challenge in autonomous driving, as online data collection is oftenunsafe and impractical. While Behavioral Cloning (BC) offers a straightforwardapproach to imitation learning, policies trained with BC are notoriouslybrittle and suffer from compounding errors in closed-loop execution. This workpresents a comprehensive pipeline and a comparative study to address thislimitation. We first develop a series of increasingly sophisticated BCbaselines, culminating in a Transformer-based model that operates on astructured, entity-centric state representation. While this model achieves lowimitation loss, we show that it still fails in long-horizon simulations. Wethen demonstrate that by applying a state-of-the-art Offline ReinforcementLearning algorithm, Conservative Q-Learning (CQL), to the same data andarchitecture, we can learn a significantly more robust policy. Using acarefully engineered reward function, the CQL agent learns a conservative valuefunction that enables it to recover from minor errors and avoidout-of-distribution states. In a large-scale evaluation on 1,000 unseenscenarios from the Waymo Open Motion Dataset, our final CQL agent achieves a3.2x higher success rate and a 7.4x lower collision rate than the strongest BCbaseline, proving that an offline RL approach is critical for learning robust,long-horizon driving policies from static expert data.