Abstract
Current visual foundation models are trained purely on unstructured 2D data,limiting their understanding of 3D structure of objects and scenes. In thiswork, we show that fine-tuning on 3D-aware data improves the quality ofemerging semantic features. We design a method to lift semantic 2D featuresinto an efficient 3D Gaussian representation, which allows us to re-render themfor arbitrary views. Using the rendered 3D-aware features, we design afine-tuning strategy to transfer such 3D awareness into a 2D foundation model.We demonstrate that models fine-tuned in that way produce features that readilyimprove downstream task performance in semantic segmentation and depthestimation through simple linear probing. Notably, though fined-tuned on asingle indoor dataset, the improvement is transferable to a variety of indoordatasets and out-of-domain datasets. We hope our study encourages the communityto consider injecting 3D awareness when training 2D foundation models. Projectpage: https://ywyue.github.io/FiT3D.