Abstract
Multimodal Large Language Models (MLLMs) have excelled in 2D image-textcomprehension and image generation, but their understanding of the 3D world isnotably deficient, limiting progress in 3D language understanding andgeneration. To solve this problem, we introduce GPT4Point, an innovativegroundbreaking point-language multimodal model designed specifically forunified 3D object understanding and generation within the MLLM framework.GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-textreference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Pointis equipped with advanced capabilities for controllable 3D generation, it canget high-quality results through a low-quality point-text feature maintainingthe geometric shapes and colors. To support the expansive needs of 3Dobject-text pairs, we develop Pyramid-XL, a point-language dataset annotationengine. It constructs a large-scale database over 1M objects of varied textgranularity levels from the Objaverse-XL dataset, essential for trainingGPT4Point. A comprehensive benchmark has been proposed to evaluate 3Dpoint-language understanding capabilities. In extensive evaluations, GPT4Pointhas demonstrated superior performance in understanding and generation.