Abstract
Recent advances in sparse voxel representations have significantly improvedthe quality of 3D content generation, enabling high-resolution modeling withfine-grained geometry. However, existing frameworks suffer from severecomputational inefficiencies due to the quadratic complexity of attentionmechanisms in their two-stage diffusion pipelines. In this work, we proposeUltra3D, an efficient 3D generation framework that significantly acceleratessparse voxel modeling without compromising quality. Our method leverages thecompact VecSet representation to efficiently generate a coarse object layout inthe first stage, reducing token count and accelerating voxel coordinateprediction. To refine per-voxel latent features in the second stage, weintroduce Part Attention, a geometry-aware localized attention mechanism thatrestricts attention computation within semantically consistent part regions.This design preserves structural continuity while avoiding unnecessary globalattention, achieving up to 6.7x speed-up in latent generation. To support thismechanism, we construct a scalable part annotation pipeline that converts rawmeshes into part-labeled sparse voxels. Extensive experiments demonstrate thatUltra3D supports high-resolution 3D generation at 1024 resolution and achievesstate-of-the-art performance in both visual fidelity and user preference.