VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Abstract

Generating 3D vehicle assets from in-the-wild observations is crucial toautonomous driving. Existing image-to-3D methods cannot well address thisproblem because they learn generation merely from image RGB information withouta deeper understanding of in-the-wild vehicles (such as car models,manufacturers, etc.). This leads to their poor zero-shot prediction capabilityto handle real-world observations with occlusion or tricky viewing angles. Tosolve this problem, in this work, we propose VQA-Diff, a novel framework thatleverages in-the-wild vehicle images to create photorealistic 3D vehicle assetsfor autonomous driving. VQA-Diff exploits the real-world knowledge inheritedfrom the Large Language Model in the Visual Question Answering (VQA) model forrobust zero-shot prediction and the rich image prior knowledge in the Diffusionmodel for structure and appearance generation. In particular, we utilize amulti-expert Diffusion Models strategy to generate the structure informationand employ a subject-driven structure-controlled generation mechanism to modelappearance information. As a result, without the necessity to learn from alarge-scale image-to-3D vehicle dataset collected from the real world, VQA-Diffstill has a robust zero-shot image-to-novel-view generation ability. We conductexperiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, todemonstrate that VQA-Diff outperforms existing state-of-the-art methods bothqualitatively and quantitatively.

Quick Read (beta)

loading the full paper ...