We present Kaolin, a PyTorch library aiming to accelerate 3D deep learningresearch. Kaolin provides efficient implementations of differentiable 3Dmodules for use in deep learning systems. With functionality to load andpreprocess several popular 3D datasets, and native functions to manipulatemeshes, pointclouds, signed distance functions, and voxel grids, Kaolinmitigates the need to write wasteful boilerplate code. Kaolin packages togetherseveral differentiable graphics modules including rendering, lighting, shading,and view warping. Kaolin also supports an array of loss functions andevaluation metrics for seamless evaluation and provides visualizationfunctionality to render the 3D results. Importantly, we curate a comprehensivemodel zoo comprising many state-of-the-art 3D deep learning architectures, toserve as a starting point for future research endeavours. Kaolin is availableas open-source software at https://github.com/NVIDIAGameWorks/kaolin/.
Quick Read (beta)
Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research
We present Kaolin11 1 Kaolin, it’s from Kaolinite, a form of plasticine (clay) that is sometimes used in 3D modeling., a PyTorch library aiming to accelerate 3D deep learning research. Kaolin provides efficient implementations of differentiable 3D modules for use in deep learning systems. With functionality to load and preprocess several popular 3D datasets, and native functions to manipulate meshes, pointclouds, signed distance functions, and voxel grids, Kaolin mitigates the need to write wasteful boilerplate code. Kaolin packages together several differentiable graphics modules including rendering, lighting, shading, and view warping. Kaolin also supports an array of loss functions and evaluation metrics for seamless evaluation and provides visualization functionality to render the 3D results. Importantly, we curate a comprehensive model zoo comprising many state-of-the-art 3D deep learning architectures, to serve as a starting point for future research endeavours. Kaolin is available as open-source software at https://github.com/NVIDIAGameWorks/kaolin/.
3D deep learning is receiving attention and recognition at an accelerated rate due to its high relevance in complex tasks such as robotics [26, 42, 34, 43], self-driving cars [31, 25, 6], and augmented and virtual reality [10, 1]. The advent of deep learning and an ever-growing compute infrastructures have allowed for the analysis of highly complicated, and previously intractable 3D data [16, 12, 29]. Furthermore, 3D vision research has started an interesting trend of exploiting well-known concepts from related areas such as robotics and computer graphics [17, 20, 23]. Despite this accelerating interest, conducting research within the field involves a steep learning curve due to the lack of standardized tools. No system yet exists that would allow a researcher to easily load popular 3D datasets, convert 3D data across various representations and levels of complexity, plug into modern machine learning frameworks, and train and evaluate deep learning architectures. New researchers in the field of 3D deep learning must inevitably compile a collection of mismatched code snippets from various code bases to perform even basic tasks, which has resulted in an uncomfortable absence of comparisons across different state-of-the-art methods.
With the aim of removing the barriers to entry into 3D deep learning and expediting research, we present Kaolin, a 3D deep learning library for PyTorch . Kaolin provides efficient implementations of all core modules required to quickly build 3D deep learning applications. From loading and pre-processing data, to converting it across popular 3D representations (meshes, voxels, signed distance functions, pointclouds, etc.), to performing deep learning tasks on these representations, to computing task-specific metrics and visualizations of 3D data, Kaolin makes the entire life-cycle of a 3D deep learning applications intuitive and approachable. In addition, Kaolin implements a large set of popular methods for 3D tasks along with their pre-trained models in our model zoo, to demonstrate the ease through which new methods can now be implemented, and to highlight it as a home for future 3D DL research. Finally, with the advent of differentiable renders for explicit modeling of geometric structure and other physical processes (lighting, shading, projection, etc.) in 3D deep learning applications [17, 21, 5], Kaolin features a generic, modular differentiable renderer which easily extends to all popular differentiable rendering methods, and is also simple to build upon for future research and development.
2 Kaolin - Overview
Kaolin aims to provide efficient and easy-to-use tools for constructing 3D deep learning architectures and manipulating 3D data. By extensively providing useful boilerplate code, 3D deep learning researchers and practitioners can direct their efforts exclusively to developing the novel aspects of their applications. In the following section, we briefly describe each major functionality of this 3D deep learning package. For an illustrated overview see Fig. LABEL:fig:splash.
2.1 3D Representations
The choice of representation in a 3D deep learning project can have a large impact on its success due to the varied properties different 3D data types posses . To ensure high flexibility in this choice of representation, Kaolin exhaustively supports all popular 3D representations:
Signed distance functions and level sets
Depth images (2.5D)
Each representation type is stored a as collection of PyTorch Tensors, within an independent class. This allows for operator overloading over common functions for data augmentation and modifications supported by the package. Efficient (and wherever possible, differentiable) conversions across representations are provided within each class. For example, we provide differentiable surface sampling mechanisms that enable conversion from polygon meshes to pointclouds, by application of the reparameterization trick . Network architectures are also supported for each representation, such as graph convolutional networks and MeshCNN for meshes[18, 14], 3D convolutions for voxels, and PointNet and PointNet++ for pointclouds[29, 39]. The following piece of example code demonstrates the ease with which a mesh model can be loaded into Kaolin, differentiably converted into a point cloud, and then rendered in both representations:
Kaolin provides complete support for many popular 3D datasets; reducing the large overhead involved in file handling, parsing, and augmentation into a single function call22 2 For datasets which do not possess open access licenses, the data must be downloaded independently, and their location specified to Kaolin’s dataloaders.. Access to all data is provided via extensions to the PyTorch Dataset, and DataLoader classes. This makes pre-processing and loading 3D data as simple and intuitive as loading MNIST , and also directly grants users the efficient loading of batched data that PyTorch dataloaders natively support. All data is importable and exportable in Universal Scene Description (USD) format , which provides a common language for defining, packaging, assembling, and editing 3D data across graphics applications.
Datasets currently supported include ShapeNet , PartNet , SHREC [4, 41], ModelNet , ScanNet , HumanSeg , and many more common and custom collections. Through ShapeNet , for example, a huge repository of CAD models is provided, including over tens of thousands of objects, across dozens of classes. Through ScanNet , more then 1500 RGD-B videos scans, including over 2.5 million unique depth maps are provided, with full annotations for camera pose, surface reconstructions, and semantic segmentations. Both these large collections of 3D information, and many more are easily accessed through single function calls. For example, access to ModelNet  providing it to a Pytorch dataloader, and loading a batch of voxel models is as easy as:
2.3 3D Geometry Functions
At the core of Kaolin is an efficient suite of 3D geometric functions, which allow manipulation of 3D content. Rigid body transformations are implemented in several of their parameterizations (Euler angles, Lie groups, and Quaternions). Differentiable image warping layers, such as the perspective warping layers defined in GVNN (Neural network library for geometric vision) , are also implemented. The geometry submodule allows for 3D rigid-body, affine, and projective transformations, as well as 3D-2D projection, and 2D-3D backprojection. It currently supports orthographic and perspective (pinhole) projection.
2.4 Modular Differentiable Renderer
Recently, differentiable rendering has manifested into an active area of research, allowing deep learning researchers to perform 3D tasks using predominantly 2D supervision [17, 21, 5]. Developing differentiable rendering tools is no easy feat however; the operations involved are computationally heavy and complicated. With the aim of removing these roadblocks to further research in this area, and to allow for easy use of popular differentiable rendering methods, Kaolin provides a flexible, and modular differentiable renderer. Kaolin defines an abstract base class—DifferentiableRenderer—containing abstract methods for each component in a rendering pipeline (geometric transformations, lighting, shading, rasterization, and projection). Assembling the components, swapping out modules, and developing new techniques using this abstract class is simple and intuitive.
Kaolin supports multiple lighting (ambient, directional, specular), shading (Lambertian, Phong, Cosine), projection (perspective, orthographic, distorted), and rasterization modes. An illustration of the architecture of the abstract DifferentiableRenderer() class is shown in Fig. 3. Wherever necessary, implementations are written in CUDA, for optimal performance (c.f. Table 2). To demonstrate the reduced overhead of development in this area, multiple publicly available differentiable renderers [17, 21, 5] are available as concrete instances of our DifferentiableRenderer class. One such example, DIB-Renderer , is instantiated and used to differentiably render a mesh to an image using Kaolin in the following few lines of code:
2.5 Loss Functions and Metrics
A common challenge for 3D deep learning applications lies in defining and implementing tools for evaluating performance and for supervising neural networks. For example, comparing surface representations such as meshes or point clouds might require matching positions of thousands of points or triangles, and CUDA functions are a necessity [9, 38, 32]. As a result, Kaolin provides implementations for an array of commonly used 3D metrics for each 3D representation. Included in this collection of metrics are intersection over union for voxels , Chamfer distance and (a quadratic approximation of) Earth-mover’s distance for pointclouds , and the point-to-surface loss  for Meshes, along with many other mesh metrics such as the laplacian, smoothness, and the edge length regularizers [38, 17].
New researchers to the field of 3D Deep learning are faced with a storm of questions over the choice of 3D representations, model architectures, loss functions, etc. We ameliorate this by providing a rich collection of baselines, as well as state-of-the-art architectures for a variety of 3D tasks, including, but not limited to classification, segmentation, 3D reconstruction from images, super-resolution, and differentiable rendering. In addition to source code, we also release pre-trained models for these tasks on popular benchmarks, to serve as baselines for future research. We also hope that this will help encourage standardization in a field where evaluation methodology and criteria are still nascent.
Methods found in this model-zoo currently include Pixel2Mesh , GEOMetrics , and AtlasNet  for reconstructing mesh objects from single images, NM3DR , Soft-Rasterizer , and Dib-Renderer  for the same task with only 2D supervision, MeshCNN  is implemented for generic learning over meshes, PointNet  and PointNet++  for generic learning over point clouds, 3D-GAN , 3D-IWGAN , and 3D-R2N2 for learning over distributions of voxels, and Occupancy Networks  and DeepSDF  for learning over level-set and SDFs, among many more. As examples of the these methods and the pre-trained models available to them in Figure 4 we highlight an array of results directly accessible through Kaolin’s model zoo.
An undeniably important aspect of any computer vision task is visualizing data. For 3D data however, this is not at all trivial. While python packages exist for visualizing some datatypes, such as voxels and point clouds, no package supports visualization across all popular 3D representations. One of Kaolin’s key features is visualization support for all of its representation types. This is implemented via lightweight visualization libraries such as Trimesh, and pptk for running time visualization. As all data is exportable to USD , 3D results can also easily be visualized in more intensive graphics applications with far higher fidelity (see Figure 4 for example renderings). For headless applications such as when running on a server that has no attached display, we provide compact utilities to render images and animations to disk, for visualization at a later point.
While we view Kaolin as a major step in accelerating 3D DL research, the efforts do not stop here. We intend to foster a strong open-source community around Kaolin, and welcome contributions from other 3D deep learning researchers and practitioners. In this section, we present a general roadmap of Kaolin as open-source software.
Model Zoo: We seek to constantly keep improving our model zoo, especially given that Kaolin provides extensive functionality that reduces the time required to implement new methods (most approaches can be implemented in a day or two of work).
Differentiable rendering: We plan on extending support to newer differentiable rendering tools, and include functionality for additional tasks such as domain randomization, material recovery, and the like.
3D object detection: Currently, Kaolin does not have models for 3D object detection in its model zoo. This is a thrust area for future releases.
Automatic Mixed Precision: To make 3D neural network architectures more compact and fast, we are investigating the applicability of Automatic Mixed Precision (AMP) to commonly used 3D architectures (PointNet, MeshCNN, Voxel U-Net, etc.). Nvidia Apex supports most AMP modes for popular 2D deep learning architectures, and we would like to investigate extending this support to 3D.
Secondary light effects: Kaolin currently only supports primary lighting effects for its differentiable rendering class, which limits the application’s ability to reason about more complex scene information such as shadows. Future releases are planned to contain support for path-tracing and ray-tracing  such that these secondary effects are within the scope of the package.
We look forward to the 3D community trying out Kaolin, giving us feedback, and contributing to its development.
The authors would like to thank Amlan Kar for suggesting the need for this library. We also thank Ankur Handa for his advice during the initial and final stages of the project. Many thanks to Johan Philion, Daiqing Li, Mark Brophy, Jun Gao, and Huan Ling who performed detailed internal reviews, and provided constructive comments. We also thank Gavriel State for all his help during the project.
-  (2016) Applying deep learning in augmented reality tracking. 2016 12th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), pp. 47–54. Cited by: §1.
-  (2017) Joint 2D-3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints. External Links: Cited by: item 3.
-  (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: item 3.
-  (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §2.2.
-  (2019) Learning to predict 3d objects with an interpolation-based differentiable renderer. NeurIPS. Cited by: §1, Figure 4, §2.4, §2.4, §2.6, Table 2.
-  (2016) Monocular 3d object detection for autonomous driving. In CVPR, Cited by: §1.
-  (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, Cited by: §2.5, §2.6.
-  (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §2.2.
-  (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Cited by: §2.5.
-  (2016) Real-time high resolution 3d data on the hololens. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), Cited by: §1.
-  (2018) AtlasNet: a papier-m^ ach’e approach to learning 3d surface generation. CVPR. Cited by: §2.6.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1.
-  (2016) Gvnn: neural network library for geometric computer vision. In ECCV Workshop on Geometry Meets Deep Learning, Cited by: Table 1, §2.3.
-  (2019) MeshCNN: a network with an edge. ACM Transactions on Graphics (TOG) 38 (4), pp. 90:1–90:12. Cited by: Figure 4, §2.1, §2.2, §2.6, Table 2.
-  (2019)(Website) University of California San Diego. External Links: Cited by: §2.1.
-  (2012) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1, §2.1.
-  (2018) Neural 3d mesh renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.4, §2.4, §2.5, §2.6.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE transactions on signam processing 86 (11), pp. 2278–2324. Cited by: §2.2.
-  (2018) Differentiable monte-carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 Technical Papers, pp. 222. Cited by: §1, item 6.
-  (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. ICCV. Cited by: §1, §2.4, §2.4, §2.6, Table 2.
-  (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2.6, Table 2.
-  (2018) Visual slam for automated driving: exploring the applications of deep learning. In CVPR Workshops, Cited by: §1.
-  (2019-06) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In CVPR, Cited by: §2.2.
-  (2017) 3d bounding box estimation using deep learning and geometry. In CVPR, Cited by: §1.
-  (2006) Natural terrain classification using three-dimensional ladar data for ground robot mobility. J. Field Robotics 23, pp. 839–861. Cited by: §1.
-  (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §2.6.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.
-  (2017) PointNet: deep learning on point sets for 3d classification and segmentation. CVPR. Cited by: §1, §2.1, §2.6.
-  (2019) Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, External Links: Cited by: Table 1.
-  (2019) Orthographic feature transform for monocular 3d object detection. British Machine Vision Conference (BMVC). Cited by: §1.
-  (2019-09–15 Jun) GEOMetrics: exploiting geometric structure for graph-encoded objects. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5866–5876. External Links: Cited by: Figure 4, §2.1, §2.5, §2.6.
-  (2017) Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, Cited by: Figure 4, §2.6.
-  (2018) Robobarista: object part based transfer of manipulation trajectories from crowd-sourcing in 3d pointclouds. In Robotics Research, pp. 701–720. Cited by: §1.
-  TurboSquid: 3d models for professionals. Note: \urlhttps://www.turbosquid.com/ Cited by: Figure 4.
-  Universal scene description. Note: \urlhttps://github.com/PixarAnimationStudios/USD Cited by: §2.2, §2.7.
-  (2019) TensorFlow graphics: computer graphics meets deep learning. Cited by: Table 1.
-  (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §2.5, §2.6.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §2.1, §2.6.
-  (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, Cited by: §2.6.
-  (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §2.2.
-  (2016) Real-time 3d scene layout from a single image using convolutional neural networks. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.
-  (2013) A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation. 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1175–1180. Cited by: §1.