Abstract
Encoder-free architectures have been preliminarily explored in the 2D visualdomain, yet it remains an open question whether they can be effectively appliedto 3D understanding scenarios. In this paper, we present the firstcomprehensive investigation into the potential of encoder-free architectures toovercome the challenges of encoder-based 3D Large Multimodal Models (LMMs).These challenges include the failure to adapt to varying point cloudresolutions and the point features from the encoder not meeting the semanticneeds of Large Language Models (LLMs). We identify key aspects for 3D LMMs toremove the encoder and enable the LLM to assume the role of the 3D encoder: 1)We propose the LLM-embedded Semantic Encoding strategy in the pre-trainingstage, exploring the effects of various point cloud self-supervised losses. Andwe present the Hybrid Semantic Loss to extract high-level semantics. 2) Weintroduce the Hierarchical Geometry Aggregation strategy in the instructiontuning stage. This incorporates inductive bias into the LLM early layers tofocus on the local details of the point clouds. To the end, we present thefirst Encoder-free 3D LMM, ENEL. Our 7B model rivals the currentstate-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on theclassification, captioning, and VQA tasks, respectively. Our resultsdemonstrate that the encoder-free architecture is highly promising forreplacing encoder-based architectures in the field of 3D understanding. Thecode is released at https://github.com/Ivan-Tang-3D/ENEL