Abstract
End-to-end autonomous driving with vision-only is not only morecost-effective compared to LiDAR-vision fusion but also more reliable thantraditional methods. To achieve a economical and robust purely visualautonomous driving system, we propose RenderWorld, a vision-only end-to-endautonomous driving framework, which generates 3D occupancy labels using aself-supervised gaussian-based Img2Occ Module, then encodes the labels byAM-VAE, and uses world model for forecasting and planning. RenderWorld employsGaussian Splatting to represent 3D scenes and render 2D images greatly improvessegmentation accuracy and reduces GPU memory consumption compared withNeRF-based methods. By applying AM-VAE to encode air and non-air separately,RenderWorld achieves more fine-grained scene element representation, leading tostate-of-the-art performance in both 4D occupancy forecasting and motionplanning from autoregressive world model.