Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

Abstract

Recent advancements in large language models and their multi-modal extensionshave demonstrated the effectiveness of unifying generation and understandingthrough autoregressive next-token prediction. However, despite the criticalrole of 3D structural generation and understanding ({3D GU}) in AI for science,these tasks have largely evolved independently, with autoregressive methodsremaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unifiedframework that seamlessly integrates {3D GU} tasks via autoregressiveprediction. At its core, Uni-3DAR employs a novel hierarchical tokenizationthat compresses 3D space using an octree, leveraging the inherent sparsity of3D structures. It then applies an additional tokenization for fine-grainedstructural details, capturing key attributes such as atom types and precisespatial coordinates in microscopic 3D structures. We further propose twooptimizations to enhance efficiency and effectiveness. The first is a two-levelsubtree compression strategy, which reduces the octree token sequence by up to8x. The second is a masked next-token prediction mechanism tailored fordynamically varying token positions, significantly boosting model performance.By combining these strategies, Uni-3DAR successfully unifies diverse {3D GU}tasks within a single autoregressive framework. Extensive experiments acrossmultiple microscopic {3D GU} tasks, including molecules, proteins, polymers,and crystals, validate its effectiveness and versatility. Notably, Uni-3DARsurpasses previous state-of-the-art diffusion models by a substantial margin,achieving up to 256\% relative improvement while delivering inference speeds upto 21.8x faster. The code is publicly available athttps://github.com/dptech-corp/Uni-3DAR.

Quick Read (beta)

loading the full paper ...