Detect Anything 3D in the Wild

Abstract

Despite the success of deep learning in close-set 3D object detection,existing approaches struggle with zero-shot generalization to novel objects andcamera configurations. We introduce DetAny3D, a promptable 3D detectionfoundation model capable of detecting any novel object under arbitrary cameraconfigurations using only monocular inputs. Training a foundation model for 3Ddetection is fundamentally constrained by the limited availability of annotated3D data, which motivates DetAny3D to leverage the rich prior knowledge embeddedin extensively pre-trained 2D foundation models to compensate for thisscarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates twocore modules: the 2D Aggregator, which aligns features from different 2Dfoundation models, and the 3D Interpreter with Zero-Embedding Mapping, whichmitigates catastrophic forgetting in 2D-to-3D knowledge transfer. Experimentalresults validate the strong generalization of our DetAny3D, which not onlyachieves state-of-the-art performance on unseen categories and novel cameraconfigurations, but also surpasses most competitors on in-domain data.DetAny3Dsheds light on the potential of the 3D foundation model for diverseapplications in real-world scenarios, e.g., rare object detection in autonomousdriving, and demonstrates promise for further exploration of 3D-centric tasksin open-world settings. More visualization results can be found at DetAny3Dproject page.

Quick Read (beta)

loading the full paper ...