Abstract
Our approach to training 3D vision-language understanding models is to traina feedforward model that makes predictions in 3D, but never requires 3D labelsand is supervised only in 2D, using 2D losses and differentiable rendering. Theapproach is new for vision-language understanding. By treating thereconstruction as a ``latent variable'', we can render the outputs withoutplacing unnecessary constraints on the network architecture (e.g. can be usedwith decoder-only models). For training, only need images and camera pose, and2D labels. We show that we can even remove the need for 2D labels by usingpseudo-labels from pretrained 2D models. We demonstrate this to pretrain anetwork, and we finetune it for 3D vision-language understanding tasks. We showthis approach outperforms baselines/sota for 3D vision-language grounding, andalso outperforms other 3D pretraining techniques. Project page:https://liftgs.github.io.