PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Abstract

While end-to-end autonomous driving models show promising results, theirpractical deployment is often hindered by large model sizes, a reliance onexpensive LiDAR sensors and computationally intensive BEV featurerepresentations. This limits their scalability, especially for mass-marketvehicles equipped only with cameras. To address these challenges, we proposePRIX (Plan from Raw Pixels). Our novel and efficient end-to-end drivingarchitecture operates using only camera data, without explicit BEVrepresentation and forgoing the need for LiDAR. PRIX leverages a visual featureextractor coupled with a generative planning head to predict safe trajectoriesfrom raw pixel inputs directly. A core component of our architecture is theContext-aware Recalibration Transformer (CaRT), a novel module designed toeffectively enhance multi-level visual features for more robust planning. Wedemonstrate through comprehensive experiments that PRIX achievesstate-of-the-art performance on the NavSim and nuScenes benchmarks, matchingthe capabilities of larger, multimodal diffusion planners while beingsignificantly more efficient in terms of inference speed and model size, makingit a practical solution for real-world deployment. Our work is open-source andthe code will be at https://maxiuw.github.io/prix.

Quick Read (beta)

loading the full paper ...