Abstract
We introduce Scan2Mesh, a novel datadriven generative approach whichtransforms an unstructured and potentially incomplete range scan into astructured 3D mesh representation. The main contribution of this work is agenerative neural network architecture whose input is a range scan of a 3Dobject and whose output is an indexed face set conditioned on the input scan.In order to generate a 3D mesh as a set of vertices and face indices, thegenerative model builds on a series of proxy losses for vertices, edges, andfaces. At each stage, we realize a onetoone discrete mapping between thepredicted and ground truth data points with a combination of convolutional andgraph neural network architectures. This enables our algorithm to predict acompact mesh representation similar to those created through manual artisteffort using 3D modeling software. Our generated mesh results thus producesharper, cleaner meshes with a fundamentally different structure from thosegenerated through implicit functions, a first step in bridging the gap towardsartistcreated CAD models.
Quick Read (beta)
Scan2Mesh: From Unstructured Range Scans to 3D Meshes
Abstract
We introduce Scan2Mesh, a novel datadriven generative approach which transforms an unstructured and potentially incomplete range scan into a structured 3D mesh representation. The main contribution of this work is a generative neural network architecture whose input is a range scan of a 3D object and whose output is an indexed face set conditioned on the input scan. In order to generate a 3D mesh as a set of vertices and face indices, the generative model builds on a series of proxy losses for vertices, edges, and faces. At each stage, we realize a onetoone discrete mapping between the predicted and ground truth data points with a combination of convolutional and graph neural network architectures. This enables our algorithm to predict a compact mesh representation similar to those created through manual artist effort using 3D modeling software. Our generated mesh results thus produce sharper, cleaner meshes with a fundamentally different structure from those generated through implicit functions, a first step in bridging the gap towards artistcreated CAD models.
1 Introduction
3D meshes are one of the most popular representations used to create and design 3D surfaces, from across content creation for movies and video games to architectural and mechanical design modeling. These mesh or CAD models are handcrafted by artists, often inspired by or mimicking realworld objects and scenes through expensive, significantly tedious manual effort. Our aim is to develop a generative model for such 3D mesh representations; that is, a mesh model described as an indexed face set: a set of vertices as 3D positions, and a set of faces which index into the vertices to describe the 3D surface of the model. In this way, we can begin to learn to generate 3D models similar to the handcrafted content creation process.
The nature of these 3D meshes, structured but irregular (e.g., irregular vertex locations, varying face sizes), make them very difficult to generate. In particular, with the burgeoning direction of generative deep learning approaches for 3D model creation and completion [5, 10, 30, 6], the irregularity of mesh structures provides a significant challenge, as these approaches are largely designed for regular grids. Thus, work in the direction of generating 3D models predominantly relies on the use of implicit functions stored in regular volumetric grids, for instance the popular truncated signed distance field representation [3]. Here, a mesh representation can be extracted at the isosurface of the implicit function through Marching Cubes [21]; however, this uniformlysampled, unwieldy triangle soup output remains fundamentally different from 3D meshes in video games or other artistcreated mesh/CAD content.
Rather than generate 3D mesh models extracted from regular volumetric grids, we instead take inspiration from 3D models that have been handmodeled, that is, compact CADlike mesh representations. Thus, we propose a novel approach, Scan2Mesh, which constructs a generative formulation for producing a mesh as a lightweight indexed face set, and demonstrate our approach to generate complete 3D mesh models conditioned on noisy, partial range scans. Our approach is the first, to the best of our knowledge, to leverage deep learning to fully generate an explicit mesh structure. From an input partial scan, we employ a graph neural network based approach to jointly predict mesh vertex positions as well as edge connectivity; this joint optimization enables reliable vertex generation for a final mesh structure. From these vertex and edge predictions, interpreting them as a graph, we construct the corresponding dual graph, with potentially valid mesh faces as dual graph vertices, from which we then predict mesh faces. These tightly coupled predictions of mesh vertices along with edge and face structure enable effective transformation of incomplete, noisy object scans to complete, compact 3D mesh models. Our generated meshes are cleaner and sharper, while maintaining fundamentally different structure from those generated through implicit functions; we believe this is a first step to bridging the gap towards artistcreated CAD models.
To sum up, our contributions are as follows:

•
A graph neural network formulation to generate meshes directly as indexed face sets.

•
Demonstration of our generative model to the task of shape completion, where we achieve significantly cleaner and more CADlike results than stateoftheart approaches.
2 Related Work
Recent advances in machine learning, coupled with the increasing availability of 3D shape and scene databases [2, 30, 4], has spurred development of deep learning approaches on 3D data. 3D ShapeNets [34] and VoxNet [22] were among the first approaches to propose 3D convolutional neural networks, both leveraging occupancybased representations encoded in regular volumetric grids in order to perform object recognition tasks. Various other approaches have since been developed upon 3D CNNbased architectures, targeting applications such as object classification [24], object detection [29], 3D keypoint matching [35], and scan completion [5, 10, 30, 6].
Such approaches have largely been developed upon regular volumetric grid representations, a natural 3D analogue to image pixels. Earlier 3D CNN approaches leveraged occupancybased volumetric representations [34, 22, 24], simply encoding whether each voxel is occupied, empty (or optionally unknown). Inspiration has also been taken from work in 3D scanning and reconstruction, where implicit volumetric representations, in particular the truncated signed distance field representation, are very popular. Such representations encode both finergrained information about the surface as well as the empty space, and have recently been effectively leveraged for both discriminative and generative tasks [5, 30, 6]. Hierarchical strategies have also been proposed to alleviate the cubic cost of these dense volumetric representations [26, 33], and have been shown to generate higherresolution 3D output grids [25, 31, 11, 10]. However, the 3D surfaces extracted from these regular volumetric grids maintain fundamentally different structure from that of handcrafted CAD models.
Pointbased representations have recently been popularized with the introduction of the PointNet architecture [23], which demonstrated 3D classification and segmentation on a more efficient 3D representation than dense volumetric grids. Generative approaches have also been developed upon point cloud representations [7], but 3D point cloud outputs lack the structured surface information of meshes.
Several approaches for inferring the mesh structure of an object from an input image have recently been introduced, leveraging very strong priors on possible mesh structure in order to create the output meshes. AtlasNet [9] learns to generate a 2D atlas embedding of the 3D mesh of an object. Another approach is to learn to deform template meshes (e.g., an ellipsoid) to create an output 3D mesh model of an object [19, 32, 12]. Such approaches generate 3D mesh surfaces as output, but are constrained to a limited set of mesh structures, whereas we aim to generate the explicit mesh structure from scratch.
In contrast to previous approaches, we take inspiration from handcrafted CAD models and develop an approach to generate the full mesh graph structure, from vertices to edges to faces. To this end, we leverage developments in machine learning approaches on graphs, in particular graph neural networks [28], to formulate an method to generate 3D mesh vertices, edges, and faces.
3 Method Overview
Our method generates a 3D mesh as a set of vertices (3D positions) and a set of $k$sided faces which index into the vertices to describe the mesh surface, conditional on an input partial range scan. Note that our approach is agnostic to the input data and representation as our focus lies in the formulation of a generative approach to explicitly generate mesh structure; in this paper, we use the task of shape completion to exemplify our approach. For shape completion, we aim to generate a complete mesh model from an input partial scan of an object. Here, the input scan is captured as depth image or set of depth images, which are then fused into a ${32}^{3}$ volumetric grid as a truncated signed distance field through volumetric fusion [3]. Training is performed with supervised inputtarget pairs, with input scans generated by virtually scanning objects from the ShapeNet dataset [2].
We propose a new graph neural network in order to predict the vertices, edges, and then faces of the mesh graph structure. First, features from the input TSDF scan are computed through a series of 3D convolutions; from this feature space, we predict a set of 3D vertex locations. These vertex locations form the nodes of the mesh graph. We construct a fully connected graph on these mesh vertices, and employ graph neural network to predict which mesh edges belong to the mesh graph structure. Note that the vertices and edges are predicted jointly in order to learn reliable vertex generation for a final mesh structure.
From the graph of intermediate predicted vertices and edges, we then construct the dual graph in order to predict the final face structure of the mesh. The nodes of the dual graph characterize potential faces (i.e., each node represents a potential face, which is a set of $k$ predicted edges that connect to form a valid $k$sided face), and we employ another graph neural network to predict the final mesh faces. We maintain losses on the vertices, edges, and faces during this mesh generation process in order to learn to generate CADlike mesh models.
4 Scan2Mesh Network Architecture
Our Scan2Mesh network architecture is visualized in Figure 2. It is composed of two main components: first, a 3Dconvolutional and graph neural network architecture to jointly predict vertex locations and edge connectivity; and second, a graph neural network to predict the final mesh face structure. For the task of shape completion, we take as input a range scan represented as a truncated signed distance field (TSDF) in a ${32}^{3}$ volumetric grid. We represent the TSDF as a $5$channel volumetric grid, in which the first two channels store the truncated unsigned distance field values and known/unknown space according to the camera trajectory of the scan, and the last three channels store the $(x,y,z)$ coordinates of the volumetric grid in the coordinate system of the mesh vertex positions, so that the TSDF volume is spatially “aligned” with the mesh – in the same spirit as the CoordConv operator proposed by [20]. The TSDF data generation of the partiallyscanned input is detailed in Sec. 5.
4.1 Joint Vertex and Edge Prediction
The TSDF input then is used to predict a set of $n$ mesh vertex locations through a series of 3D convolutions (kernel sizes $4,3,3,3$, all but the last followed by a $1\times 1\times 1$ convolution). The resulting feature space, $f(\text{TSDF})$ is used to predict an $n\times 3$ tensor of $n$ vertex position through a series of two fullyconnected layers. We also denote the intermediary feature space after two sets of 3D convolutions as ${f}_{2}(\text{TSDF})$, which is used to capture spatial features of the input scan to inform the edge prediction.
We then construct a fullyconnected graph with $n$ nodes corresponding to the $n$ vertex positions. The initial node features are characterized by the $3$dimensional vertex positions, in addition to the closest feature vector in ${f}_{2}(\text{TSDF})$ by looking up the vertex positions into the ${f}_{2}(\text{TSDF})$ grid. We propose a graph neural network on this graph, which remains agnostic to the vertex ordering. For a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ comprising vertices $v\in \mathcal{V}$ and edges $e=(v,{v}^{\prime})\in \mathcal{E}$, messages are passed from nodes to edges, and edges to nodes as follows, similar to [8, 17]:
$$v\to e:{\mathbf{h}}_{i,j}^{\prime}={f}_{e}([{\mathbf{h}}_{i},{\mathbf{h}}_{j}])$$ 
$$e\to v:{\mathbf{h}}_{i}^{\prime}={f}_{v}(\sum _{\{{e}_{i,j}\}}{\mathbf{h}}_{i,j})$$ 
where ${\mathbf{h}}_{i}$ represents the feature of vertex ${v}_{i}$, ${\mathbf{h}}_{i,j}$ represents the feature of edge ${e}_{i,j}$, and $[\cdot ,\cdot ]$ denotes concatenation. Thus, an edge ${e}_{i,j}$ receives updates through the concatenated features of the vertices ${v}_{i},{v}_{j}$ it is defined by, and a vertex ${v}_{i}$ receives updates through the sum of the features of the edges ${e}_{i,j}$ incident on ${v}_{i}$. ${f}_{v}$ and ${f}_{e}$ are MLPs operating on nodes and edges, respectively. For full network architecture details regarding layer definitions, please see the appendix.
The vertices are optimized for with an ${\mathrm{\ell}}_{1}$ loss, where we first compute a onetoone mapping between the predicted vertices and ground truth vertices using the Hungarian algorithm [18]. This onetoone mapping during training is essential for predicting reliable mesh structure; a greedy approach (e.g., Chamfer loss on vertices) results in collapse of smaller structures as shown in Figure. 3.
The output predicted vertices along with the input scan features ${f}_{2}(\text{TSDF})$ are then used to predict edge connectivity on the graph of the mesh with vertices as nodes. Each node is initially associated with two features, the $3$dimensional vertex positions and the closest feature vector in ${f}_{2}(\text{TSDF})$ according to the respective vertex positions. These features are processed independently through small MLPs, then concatenated to the form vertex features which are then processed through graph message passing. For each edge in the fullyconnected graph, we predict whether it belongs to the mesh graph structure using a (twodimensional) cross entropy loss. The vertex positions and edges are optimized for jointly in order to reliably predict vertices belonging to a mesh structure.
4.2 Mesh Face Prediction
We predict the final mesh faces from these intermediate predicted vertices and edges by transforming the graph of predicted mesh vertices and edges into its dual graph. This dual graph comprises the set of valid potential faces as the nodes of the graph, with a (dual graph) edge between two nodes if the two potential faces share an edge. The nodes are represented by an $8$dimensional feature vector comprising the centroid, normal, surface area, and radius of its respective potential face. We then employ a graph neural network formulated similarly as that for the vertex and edge prediction, this time predicting which faces belong to the final mesh structure. Note that final mesh face predictions need not contain all intermediary predicted edges. We first train the face prediction using a cross entropy loss on the nodes of the dual graph, and then use a chamfer loss between points sampled from the predicted mesh and the target mesh in order to better encourage all structurally important faces to be predicted.
Method  Mesh Distance  Mesh Normal Similarity 

Poisson Surface Reconstruction [13, 14]  0.0136  0.61 
ShapeRecon [27]  0.0075  0.61 
3D ShapeNets [34]  0.0027  0.60 
3DEPN [5]  0.0023  0.70 
Ours  0.0016  0.90 
Input  Average  Chairs  Tables  Airplanes  Dressers  Lamps  Boats  Sofas  Cars  

Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  
Points  0.0019  0.87  0.0016  0.88  0.0022  0.88  0.0008  0.94  0.0015  0.87  0.0045  0.82  0.0013  0.86  0.0021  0.86  0.0009  0.83 
TSDF  0.0016  0.90  0.0015  0.90  0.0021  0.89  0.0010  0.93  0.0014  0.88  0.0029  0.86  0.0011  0.88  0.0016  0.91  0.0008  0.91 
4.3 Training
To train our model, we use the training data generated from the ShapeNet dataset [2] as described in Sec. 5.
We use the ADAM optimizer [15] with a learning rate of $0.0005$ and batch size of 8 for all training. We train on eight classes of the ShapeNet dataset, following the train/test split of [5]. We additionally follow their training data augmentation, augmenting each train object by generating two virtual scanning trajectories for each object, resulting in $48,166$ train samples and $10,074$ validation samples.
We train the vertexedge prediction for $5$ epochs ($\approx 15$ hours). While we found it sufficient to train the joint vertexedge prediction through finding a onetoone mapping between the predicted vertices and ground truth mesh vertices (the edges following as vertex indices), we found that for training face prediction with cross entropy loss, the onetoone mapping sometimes resulted in distorted target faces, and it was more reliable to train the model on dual graphs computed from the ground truth meshes. Thus we first train the face prediction network for $1$ epoch ($\approx 6$ hours) using a cross entropy loss and ground truth dual graph data, and then train on dual graphs from predicted vertices and edges using a chamfer loss between the predicted and target meshes (for $1$ epoch, $\approx 18$ hours).
Average  Chairs  Tables  Airplanes  Dressers  Lamps  Boats  Sofas  Cars  

Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  
Greedy  0.0022  0.73  0.0020  0.73  0.0027  0.76  0.0012  0.82  0.0020  0.71  0.0047  0.72  0.0016  0.713  0.0021  0.72  0.0011  0.71 
1to1  0.0016  0.90  0.0015  0.90  0.0021  0.89  0.0010  0.93  0.0014  0.88  0.0029  0.86  0.0011  0.88  0.0016  0.91  0.0008  0.91 
5 Data Generation
For training data generation, we use the ShapeNet model database [2], and train on a subset of 8 classes (see Sec. 6). We follow the training data generation process of [5], generating training inputtarget pairs by virtually scanning the ShapeNet objects along the camera trajectories given by their ShapeNet virtual scans dataset. We use two trajectories for each object for training. The virtually captured depth map(s) along these trajectories are then fused into a ${32}^{3}$ grid through volumetric fusion [3] to obtain input TSDFs. We use a truncation of $3$ voxels for all experiments. An object is mapped from its world space into a ${32}^{3}$ grid by scaling the largest bounding box extent to $323*2$ (for $3$ voxels of padding on each side).
For ground truth meshes, we use triangle meshes simplified from ShapeNet models. In order to both reduce the complexity of the graph sizes as well as unify some of the irregularity of the ShapeNet meshes, we simplify all target meshes to $100$ vertices each using the VHCAD library [1], which approximately maintains the convex hull of the original mesh.
6 Results and Evaluation
Average  Chairs  Tables  Airplanes  Dressers  Lamps  Boats  Sofas  Cars  

Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  Dist  NSim  
Direct(GT)  0.0042  0.74  0.0035  0.70  0.0042  0.75  0.0078  0.71  0.0030  0.76  0.0053  0.77  0.0022  0.77  0.0058  0.67  0.0014  0.78 
Direct(Surf)  0.0031  0.78  0.0031  0.73  0.0028  0.79  0.0025  0.85  0.0028  0.74  0.0077  0.77  0.0016  0.82  0.0033  0.72  0.0010  0.86 
Dual Pred  0.0016  0.90  0.0015  0.90  0.0021  0.89  0.0010  0.93  0.0014  0.88  0.0029  0.86  0.0011  0.88  0.0016  0.91  0.0008  0.91 
In this section, we provide an evaluation of our proposed method with a comparison to existing approaches on the task of scan completion of 3D shapes. We evaluate on the ShapeNet [2] dataset, using the train/test split provided by 3DEPN [5] comprising 8 classes: chairs, tables, sofas, dressers, lamps, boats, cars, and airplanes. We test on the $1200$ object test set proposed by 3DEPN of single depth image range scans ($150$ objects per class). We compare our mesh results to meshes produced by stateoftheart approaches; in the case that an approach generates an implicit function, we extract an output mesh using Matlab’s isosurface function. To measure the mesh quality, we employ two metrics: (1) we measure the mesh completeness using a chamfer distance between uniformly sampled points from the predicted mesh and the ground truth mesh, and (2) we measure the normal deviation from the ground truth mesh to characterize mesh sharpness and cleanness. The normal deviation metric is computed bidirectionally: for meshes ${M}_{a},{M}_{b}$ we compute the normal deviation $N({M}_{b},{M}_{a})$ from ${M}_{b}$ to ${M}_{a}$ as the average of the cosine of the normal angle difference for the closest face in ${M}_{b}$ to each face ${M}_{a}$, and take $0.5(N({M}_{a},{M}_{b})+N({M}_{b},{M}_{a}))$ as the global normal deviation (the closest face is found by the distance between the faces weighted by $(1\mathrm{cos}({n}_{i},{n}_{j}))$ when the bounding boxes of the faces intersect, and the distance between the face bounding boxes otherwise, to disambiguate nearest faces in cases where faces form corners).
Comparison to state of the art. We evaluate against stateoftheart shape completion approaches in Table 1 and Figure 4. Additionally, we evaluate various design choices in Tables 2, 3, and 4. Here, we see that our approach generates sharper, cleaner meshes than previous volumetricbased approaches while producing accurate completeness in global shape structure.
What is the impact of the input scan representation? We evaluate our approach using a point cloud representation of the input range scan (uniformly sampled from the range image inputs) in comparison to a TSDF in Table 2. Both representations produce good mesh results, but we find that regularity and encoding of known and unknown space in the TSDF representation produces better completion and mesh quality.
Do we need a 1to1 mapping between prediction and target during training? In Table 3, we evaluate using a greedy mapping between predicted vertices and target vertices during vertexedge training. Using a greedy mapping degrades the quality of vertex predictions with respect to the final mesh structure (e.g., we want a cluster of vertices at the end of a chair leg instead of one vertex), and results in worse mesh reconstruction quality (see Figure 3 for an example visualization).
Why use the dual graph for face prediction? We evaluate our face prediction approach, which leverages the dual graph of the mesh vertexedge graph, in Table 4. Here, we compare against directly predicting mesh faces using the same formulation as the joint vertexedge prediction, where instead of predicting edges as which two vertices are connected, we predict faces as which sets of three vertices are connected, resulting in $\mathcal{O}({n}^{3})$ possible faces from which the mesh faces must be predicted (we refer to the appendix for more detail regarding directly predicting faces). Given the large combinatorics here, where the number of ground truth mesh faces is approximately $0.2\%$ of the number of total possible faces (for $n=100$), we evaluate two possibilities for training the direct face prediction: Direct(GT) uses only the ground truth mesh faces as target faces, and Direct(Surf) which uses all possible faces close to the ground truth mesh surface as target faces. Both approaches nonetheless suffer from the heavy combinatorics, whereas our approach of predicting faces by using the dual graph of the mesh vertexedge graph produces significantly better mesh structure and completeness.
6.1 Limitations
We propose one of the first approaches to explicitly generate a mesh as an indexed face set, and believe that this is a stepping stone towards future work in constructing CADlike 3D models akin to those currently handcrafted. For instance, our use of fullyconnected graphs limits the size of our models; adapting the graphs and message passing to enable learning on significantly larger mesh graphs would open up generation of higher resolution or larger scale models. Additionally, we do not explicitly enforce mesh regularity or surface continuity (which are also not given in the ShapeNet models); adding hard constraints into the optimization to guarantee these attributes would open up many more applications for these models.
7 Conclusion
We presented Scan2Mesh, a generative model for creating 3D mesh models as indexed face sets, inspired by 3D model representations used in handcrafted 3D models. We proposed a new graph neural network formulation to generate a mesh representation directly, and demonstrated our mesh generation on the task of shape completion, achieving cleaner and more CADlike mesh models from noisy, partial range scans. We believe that this opens up myriad possibilities towards bridging the gap of 3D model generation towards the quality of artistcreated CAD models.
Acknowledgments
This work was supported by a Google Research Grant, a TUM Foundation Fellowship, a TUMIAS Rudolf Mößbauer Fellowship, and the ERC Starting Grant Scan2CAD.
References
 [1] VHCAD. https://github.com/kmammou/vhacd.
 [2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 [3] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303–312. ACM, 1996.
 [4] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
 [5] A. Dai, C. R. Qi, and M. Nießner. Shape completion using 3dencoderpredictor cnns and shape synthesis. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
 [6] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner. Scancomplete: Largescale scene completion and semantic segmentation for 3d scans. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [7] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
 [8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 [9] T. Groueix, M. Fisher, V. G. Kim, B. Russell, and M. Aubry. AtlasNet: A PapierMâché Approach to Learning 3D Surface Generation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.
 [10] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. High Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference. In IEEE International Conference on Computer Vision (ICCV), 2017.
 [11] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. arXiv preprint arXiv:1704.00710, 2017.
 [12] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In ECCV, 2018.
 [13] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006.
 [14] M. Kazhdan and H. Hoppe. Screened poisson surface reconstruction. ACM Transactions on Graphics (TOG), 32(3):29, 2013.
 [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [16] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [17] T. Kipf, E. Fetaya, K.C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687, 2018.
 [18] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 [19] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. arXiv preprint arXiv:1712.00268, 2017.
 [20] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. arXiv preprint arXiv:1807.03247, 2018.
 [21] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In ACM siggraph computer graphics, volume 21, pages 163–169. ACM, 1987.
 [22] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 [23] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
 [24] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas. Volumetric and multiview cnns for object classification on 3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
 [25] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. Octnetfusion: Learning depth fusion from data. arXiv preprint arXiv:1704.01047, 2017.
 [26] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [27] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3d object shape from one depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2484–2493, 2015.
 [28] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 [29] S. Song and J. Xiao. Deep sliding shapes for amodal 3d object detection in rgbd images. arXiv preprint arXiv:1511.02300, 2015.
 [30] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [31] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. arXiv preprint arXiv:1703.09438, 2017.
 [32] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.
 [33] P.S. Wang, Y. Liu, Y.X. Guo, C.Y. Sun, and X. Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
 [34] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 [35] A. Zeng, S. Song, M. Nießner, M. Fisher, and J. Xiao. 3dmatch: Learning the matching of local 3d geometry in range scans. arXiv preprint arXiv:1603.08182, 2016.
Appendix A Appendix
A.1 Network Architecture Details
Figure 6 shows the detailed specification of our network architecture. Convolutions are specified by (kernel size, stride, padding), and are each followed by ReLUs. For both graph neural networks, each fully connected layer (except the last) is followed by an ELU, and within each pair of fully connected layers, we use a dropout of $0.5$ during training and batch normalization following each pair. The nodetoedge and edgetonode message passing operations are as defined in the main paper, through concatenation and summation, respectively.
A.2 Learned Feature Space
In Figure 7, we visualize the tSNE of the latent vectors learned from our Scan2Mesh model trained for shape completion. We extract the latent vectors of a set of test input partial scans (the $256$dimensional vector encoding) and use tSNE to visualize the feature space as a 2dimensional embedding, with images of the partial scans displayed accordingly. Our model learns to cluster together shapes of similar geometric structure.
A.3 Shape Generation
In the main paper, we demonstrate our mesh generation approach on the task of scan completion of shapes; we can also apply it to other tasks such as shape generation. Here, instead of learning an encoder for TSDF scans of shapes, we train a variational autoencoder [16] to produce mesh vertices and edges (on the same 8class training set of ShapeNet [2] objects). Figure 8 shows shapes generated by drawing random samples from a unit normal distribution, along with nearest neighbor ground truth objects.
A.4 Direct Mesh Face Prediction Details
Here, we further describe the details of the approach to directly predict mesh faces from a single graph neural network, as presented in the ablation study of the main results section. This graph network structure has the (predicted) mesh vertices as its nodes, with message passing then operating on every set of 3 nodes (assuming a triangle mesh structure). Messages are then passed from node to face through concatenation, and from face to node through summation, similar to the nodeedge message passing:
$$v\to f:{\mathbf{h}}_{i,j,k}^{\prime}={g}_{f}([{\mathbf{h}}_{i},{\mathbf{h}}_{j},{\mathbf{h}}_{k}])$$ 
$$f\to v:{\mathbf{h}}_{i}^{\prime}={g}_{v}(\sum _{\{{f}_{i,j,k}\}}{\mathbf{h}}_{i,j,k})$$ 
where an updated face feature is the concatenation of the node features from which it is composed, and an updated node feature is the sum of features of all faces incident on that node. Here, even for triangle meshes, the combinatorics grows tremendously with $\mathcal{O}({n}^{3})$, making the optimization for face structure challenging.