Abstract
Leveraging recent diffusion models, LiDAR-based large-scale 3D scenegeneration has achieved great success. While recent voxel-based approaches cangenerate both geometric structures and semantic labels, existing range-viewmethods are limited to producing unlabeled LiDAR scenes. Relying on pretrainedsegmentation models to predict the semantic maps often results in suboptimalcross-modal consistency. To address this limitation while preserving theadvantages of range-view representations, such as computational efficiency andsimplified network design, we propose Spiral, a novel range-view LiDARdiffusion model that simultaneously generates depth, reflectance images, andsemantic maps. Furthermore, we introduce novel semantic-aware metrics toevaluate the quality of the generated labeled range-view data. Experiments onthe SemanticKITTI and nuScenes datasets demonstrate that Spiral achievesstate-of-the-art performance with the smallest parameter size, outperformingtwo-step methods that combine the generative and segmentation models.Additionally, we validate that range images generated by Spiral can beeffectively used for synthetic data augmentation in the downstream segmentationtraining, significantly reducing the labeling effort on LiDAR data.