Abstract
Autoregressive Transformers are increasingly being deployed as end-to-endrobot and autonomous vehicle (AV) policy architectures, owing to theirscalability and potential to leverage internet-scale pretraining forgeneralization. Accordingly, tokenizing sensor data efficiently is paramount toensuring the real-time feasibility of such architectures on embedded hardware.To this end, we present an efficient triplane-based multi-camera tokenizationstrategy that leverages recent advances in 3D neural reconstruction andrendering to produce sensor tokens that are agnostic to the number of inputcameras and their resolution, while explicitly accounting for their geometryaround an AV. Experiments on a large-scale AV dataset and state-of-the-artneural simulator demonstrate that our approach yields significant savings overcurrent image patch-based tokenization strategies, producing up to 72% fewertokens, resulting in up to 50% faster policy inference while achieving the sameopen-loop motion planning accuracy and improved offroad rates in closed-loopdriving simulations.