Abstract
Being data-driven is one of the most iconic properties of deep learningalgorithms. The birth of ImageNet drives a remarkable trend of "learning fromlarge-scale data" in computer vision. Pretraining on ImageNet to obtain richuniversal representations has been manifested to benefit various 2D visualtasks, and becomes a standard in 2D vision. However, due to the laboriouscollection of real-world 3D data, there is yet no generic dataset serving as acounterpart of ImageNet in 3D vision, thus how such a dataset can impact the 3Dcommunity is unraveled. To remedy this defect, we introduce MVImgNet, alarge-scale dataset of multi-view images, which is highly convenient to gain byshooting videos of real-world objects in human daily life. It contains 6.5million frames from 219,188 videos crossing objects from 238 classes, with richannotations of object masks, camera parameters, and point clouds. Themulti-view attribute endows our dataset with 3D-aware signals, making it a softbridge between 2D and 3D vision. We conduct pilot studies for probing the potential of MVImgNet on a varietyof 3D and 2D visual tasks, including radiance field reconstruction, multi-viewstereo, and view-consistent image understanding, where MVImgNet demonstratespromising performance, remaining lots of possibilities for future explorations. Besides, via dense reconstruction on MVImgNet, a 3D object point clouddataset is derived, called MVPNet, covering 87,200 samples from 150 categories,with the class label on each point cloud. Experiments show that MVPNet canbenefit the real-world 3D object classification while posing new challenges topoint cloud understanding. MVImgNet and MVPNet will be publicly available, hoping to inspire the broadervision community.