Aligning Text, Images, and 3D Structure Token-by-Token

Abstract

Creating machines capable of understanding the world in 3D is essential inassisting designers that build and edit 3D environments and robots navigatingand interacting within a three-dimensional space. Inspired by advances inlanguage and image modeling, we investigate the potential of autoregressivemodels for a new modality: structured 3D scenes. To this end, we propose aunified LLM framework that aligns language, images, and 3D scenes and provide adetailed ''cookbook'' outlining critical design choices for achieving optimaltraining and performance addressing key questions related to datarepresentation, modality-specific objectives, and more. We evaluate performanceacross four core 3D tasks -- rendering, recognition, instruction-following, andquestion-answering -- and four 3D datasets, synthetic and real-world. We extendour approach to reconstruct complex 3D object shapes by enriching our 3Dmodality with quantized shape encodings, and show our model's effectiveness onreal-world 3D object recognition tasks. Project webpage:https://glab-caltech.github.io/kyvo/

Quick Read (beta)

loading the full paper ...