Abstract
Realistic and controllable traffic simulation is a core capability that isnecessary to accelerate autonomous vehicle (AV) development. However, currentapproaches for controlling learning-based traffic models require significantdomain expertise and are difficult for practitioners to use. To remedy this, wepresent CTG++, a scene-level conditional diffusion model that can be guided bylanguage instructions. Developing this requires tackling two challenges: theneed for a realistic and controllable traffic model backbone, and an effectivemethod to interface with a traffic model using language. To address thesechallenges, we first propose a scene-level diffusion model equipped with aspatio-temporal transformer backbone, which generates realistic andcontrollable traffic. We then harness a large language model (LLM) to convert auser's query into a loss function, guiding the diffusion model towardsquery-compliant generation. Through comprehensive evaluation, we demonstratethe effectiveness of our proposed method in generating realistic,query-compliant traffic simulations.