Abstract
3D models are commonly used in computer vision and graphics. With the wideravailability of mesh data, an efficient and intrinsic deep learning approach toprocessing 3D meshes is in great need. Unlike images, 3D meshes have irregularconnectivity, requiring careful design to capture relations in the data. Toutilize the topology information while staying robust under differenttriangulation, we propose to encode mesh connectivity using Laplacian spectralanalysis, along with Mesh Pooling Blocks (MPBs) that can split the surfacedomain into local pooling patches and aggregate global information among them.We build a mesh hierarchy from fine to coarse using Laplacian spectralclustering, which is flexible under isometric transformation. Inside the MPBsthere are pooling layers to collect local information and multilayerperceptrons to compute vertex features with increasing complexity. To obtainthe relationships among different clusters, we introduce a Correlation Net tocompute a correlation matrix, which can aggregate the features globally bymatrix multiplication with cluster features. Our network architecture isflexible enough to be used on meshes with different numbers of vertices. Weconduct several experiments including shape segmentation and classification,and our LaplacianNet outperforms stateoftheart algorithms for these tasks onShapeNet and COSEG datasets.
Quick Read (beta)
LaplacianNet: Learning on 3D Meshes with Laplacian Encoding and Pooling
Abstract
3D models are commonly used in computer vision and graphics. With the wider availability of mesh data, an efficient and intrinsic deep learning approach to processing 3D meshes is in great need. Unlike images, 3D meshes have irregular connectivity, requiring careful design to capture relations in the data. To utilize the topology information while staying robust under different triangulation, we propose to encode mesh connectivity using Laplacian spectral analysis, along with Mesh Pooling Blocks (MPBs) that can split the surface domain into local pooling patches and aggregate global information among them. We build a mesh hierarchy from fine to coarse using Laplacian spectral clustering, which is flexible under isometric transformation. Inside the MPBs there are pooling layers to collect local information and multilayer perceptrons to compute vertex features with increasing complexity. To obtain the relationships among different clusters, we introduce a Correlation Net to compute a correlation matrix, which can aggregate the features globally by matrix multiplication with cluster features. Our network architecture is flexible enough to be used on meshes with different numbers of vertices. We conduct several experiments including shape segmentation and classification, and our LaplacianNet outperforms stateoftheart algorithms for these tasks on ShapeNet and COSEG datasets.
1 Introduction
Analyzing highquality 3D models is of great significance in computer vision and graphics. Better understanding of 3D shapes would benefit many tasks such as segmentation, classification and shape analysis. Recently, deep learning methods have been prevalent in 2D image processing tasks such as image classification [1, 2] and semantic segmentation [3, 4]. With the help of largescale image datasets and improved computational resources, deep learning methods boost performance of image processing algorithms by a large margin. Inspired by the success in images, researchers also apply learning algorithms to 3D data. Recently, largescale 3D datasets have made it possible to train neural networks for 3D shapes. Nevertheless, it is not a simple extension to apply neural networks in the 3D space. There are various 3D representations. The majority of 3D representations such as meshes, point clouds etc. are noncanonical, requiring special design to put them through neural networks. To address this, some approaches are trained on ModelNet [5] and deal with voxels, but the resolution of voxel data is limited due to the curse of dimensionality. Alternatively, point clouds representing an object by a set of unstructured points with their $xyz$ coordinates are commonly used. However, point clouds do not carry connectivity information and therefore are less efficient than meshes to represent shapes, and may have ambiguities when two surfaces are close. Another representation, the 3D mesh, is a fundamental data structure in computer graphics and vision, which not only encodes geometry but also topology and therefore has better descriptive power than the point cloud. A mesh is a graph with vertices, edges and faces that characterize the surface of a shape. For deep learning methods, mesh data is more compact but irregular when compared to voxels, making simple operations in the image domain such as convolutional kernels highly nontrivial. It also contains richer structure than a point cloud, which can be exploited when learning on 3D meshes. This paper aims to propose a flexible network structure that can perceive connectivity information while staying robust under different triangulation. To learn on 3D meshes, we propose the Laplacian Encoding and Pooling Network (LaplacianNet), which takes raw features of mesh models as input and outputs a function defined on the vertices. Inspired by image processing networks, we observe an intuitive principle that vertex features should be computed independently and associatively in different parts of the network. We therefore extend this ideology into nonEuclidean space of 2manifolds. In our design as in Figure 1, the basic network structure involves consecutive Mesh Pooling Blocks (MPBs). Each block can split the surface into patches, like superpixels in the image domain, by Laplacian spectral clustering. After splitting, the MPB can simultaneously compute features of individual vertices and clusters. Considering the relationships between clusters, we use a Correlation Net to compute a matrix that can fuse the information globally. Compared to images, a major difficulty for learning on meshes is that the vertices are unordered, and so are the clusters. For this reason, a fullyconnected layer cannot work out the correlation matrix effectively. Therefore, we disentangle the correlation by independently mapping the clusters into a vector space. Then, the correlation between a pair of clusters is determined by the inner product of their corresponding vectors. Before our LaplacianNet, effort has been made to learn on 3D meshes. Please refer to [6] for history and frontier research of deep learning on geometric data. Some works exploit geodesic distances on shapes [7, 8]. Based on this metric they design a spatial convolutional operator. Geodesic distance could be invariant under isometric transformation, making their networks more robust. Compared to their directly computing the distance, our network utilizes the geodesic distance in an implicit way. The clustering is performed in the Laplacian feature domain, where Euclidean distance can approximate geodesic distance on the manifold [9]. As a result, the pooling could stay robust under isometric deformation and different triangulation (shown in Figure 3). Others choose to work in the spectral domain [10, 11], by defining convolutional kernels in the Fourier domain. However, the dependency of coefficients on the domain basis makes it difficult to share weights across shapes. Consequently, works like [12] have modules purposefully designed to synchronize the basis of domains. Our network does not suffer from changing domains, and we also propose a flexible structure, Correlation Net, to address the alignment of clusters across models. Compared to all the aforementioned methods, we use both spectral and spatial information, such that our network can utilize the connectivity of meshes while staying robust under different domains with inconsistent triangulation. The pipeline of our approach is shown in Figure 1, which learns crossdomain mesh functions using both spatial and spectral information. Overall, LaplacianNet has the following three modules: 1) the preprocessing step computes vertex features and clusters from the raw mesh data; 2) the Mesh Pooling Block (MPB) calculates local features within local regional clusters and collects global information through Correlation Net; 3) the last part of the network depends on the specific application, e.g., it may output segmentation masks or classification labels. In the experiments, we first evaluate the importance of mesh pooling blocks and the choice of the input features. We then justify that our single network can deal well with different mesh models with different numbers of vertices. To test the capability of our network, we train our network on the ShapeNet and COSEG datasets to perform classification and segmentation tasks, which are fundamental shape understanding tasks in computer vision, and show superior overall performance. The main contributions of our method are as follows:

•
We propose the Laplacian Encoding and Pooling Network (LaplacianNet), a general network for learning on 3D meshes, which can utilize the connectivity of meshes while staying robust under different triangulation.

•
We propose a flexible pooling strategy that can split model surfaces into clusters, like superpixels in images. By varying the clusters from fine to coarse, the network can process meshes hierarchically.

•
We introduce a Correlation Net to compute the relationship among clusters. The computation process circumvents the randomness of cluster ordering, enabling consistency across domains.
2 Related Work
We first summarize deep learning methods on 3D representations, and then provide a brief introduction to alternative input shape features. Finally, we review recent methods for mesh segmentation, which is a fundamental task when analyzing shapes. 3D Deep Learning. With the increasing availability of 3D models, deep learning methods for analyzing 3D data structure have been widely studied nowadays. There are several representations for 3D shapes, including voxels, point clouds and meshes. The voxel representation is similar to pixels in the 3D space, which can utilize a direct extension of 2D convolutional networks [13]. The point cloud is intensively researched, with works like [14] learning the transformation matrix of points and obtaining decent results on multiple datasets. Qi et al. [15] further propose PointNet++ that adds pooling operations, where pooling areas are selected by nearest neighbors. We aim at different problems: While they address problems on point clouds, our method focuses on meshes. To better understand 3D meshes, the topology information is deeply exploited in our method. First, we use topology to perform clustering. In contrast, the nearest neighbor pooling strategy of PointNet++ cannot perceive surface structure of meshes. Second, we use features that contain topology information as input. Meanwhile, we believe that it is not appropriate to carry the entire topology during the whole process. The connectivity information is irregular, large, hard to compute, and sensitive to noise. We instead condense the topology information by encoding the connectivity into the pooling areas and input features. For the mesh representation, spatial methods [7, 8] define convolutional kernels on surfaces. Our LaplacianNet also utilizes spatial information, as the pooling strategy partitions the surface into patches, acting like superpixels in image processing. For spectral methods, Bruna et al. [10] introduce a spectral convolutional layer on graphs, which can be interpreted as convolutions in the Fourier domain. Henaff et al. [16] handle largescale classification problems. They propose to exploit the underlying relationships among individual data points by constructing a graph of the dataset and then solving a graphbased classification. When processing the graph, they introduce a CNN structure with spectral pooling. Compared with ours, their work is interested in each node and the pooling cluster does not contain geometric information. A fundamental problem of spectral convolution is generalizing across domains since the coefficients of spectral filters are basis dependent [6]. To address this problem, Yi et al. [12] further propose SyncSpecCNN to perform convolutional operations in a synchronized spectral domain, where they train a Spectral Transformer Network to align the functions in different domains. Compared to all the spectral CNN methods above, our LaplacianNet takes advantage of Laplacian spectral analysis to encode mesh topology and identify a spatial pooling strategy but avoids suffering from its dependency on the domain. We also show in the Experiments section that our method is robust under different object categories, number of vertices, and triangulation. Shape Features. Over the years, many shape features have been developed to describe shape characteristics, including curvatures, geodesic shape contexts, geodesic distance features as used in [17], Heat Kernel Signatures [18], Wave Kernel Signatures [19], etc. Our LaplacianNet exploits mesh Laplacian spectral analysis, which provides an effective encoding of mesh topology. Laplacian eigenvectors also help us to cluster vertices for pooling layers, and it is an intrinsic feature describing the geometry of shapes. Readers can refer to [20] for details about computation and applications of graph Laplacian. However, two essential problems with Laplacian eigenvectors are that the sign is not welldefined, and perturbation occasionally occurs in high frequency terms. To eliminate these ambiguities, we use the absolute value of low frequency terms as input. Mesh Segmentation. Mesh segmentation has long been a fundamental task in the field of computer vision and graphics. There are unsupervised and supervised methods to perform this task. For unsupervised methods, recent work usually uses the correspondence or correlation between shape units to cosegment a collection of objects in the same category [21, 22, 23, 24]. Those methods essentially analyze a whole dataset of similar 3D shapes and cluster shape parts that can be consistently segmented into one class. Other works try to take advantage of labeled data to develop a supervised method. Thanks to recent shape segmentation datasets [25, 26, 27, 28], supervised methods obtain higher accuracy than unsupervised ones. Among all those datasets, COSEG [26] and ShapeNet [26] have sufficiently many samples to train a network with reasonable generalizability, so we conduct experiments on the two datasets. Previous deep learning methods usually design different architectures to do segmentation. For example, George et al. [29] design a multibranch 1D convolutional network and Wang et al. [30] put convolutional kernels on neighboring points and a pooling layer on the coarsened mesh. A 2D CNN is embedded into a 3D surface network by a projection layer in [31]. Guo et al. [32] concatenate different feature vectors into a rectangular block and apply CNNs to this imagelike domain. Our method aims to develop a general network for learning on 3D meshes, and we demonstrate that our general approach outperforms existing methods in most cases.
3 Methodology
3.1 Problem Statement
Our proposed network is a general method for 3D meshes, capable of dealing with different numbers of vertices. Suppose that the current input $\mathcal{G}=(\mathcal{V},\mathcal{E})$ is a mesh with $N$ vertices. There is an input feature function $f$ defined on vertices $\mathcal{V}$, i.e., $f:\mathcal{V}\to {\mathbb{R}}^{N\times c}$ where $c$ is the dimension of input features, typically containing coordinates, normals, mesh Laplacian, curvatures, etc. At the same time, there is a target function $g$ we aim to produce. It can usually be a vector function such as a segmentation $g:\mathcal{V}\to C_{s}{}^{N}$ where each entry corresponds to the label of a vertex, or a single value like classification category for the whole object $g:\mathcal{G}\to {C}_{l}$, where ${C}_{s}$ and ${C}_{l}$ are the sets of segmentation labels and classification categories respectively. We would like to mention that $g$ may also represent other functions including texture or normal. Our aim is to design a general neural network that learns the mapping from input feature $f$ to the output $g$. For the segmentation and classification tasks, we precompute the normal and mesh Laplacian eigenvectors as input features. Our network will output an $N\times {C}_{s}$ matrix for segmentation, which gives the score for each vertex belonging to each segmentation label, or a ${C}_{l}$ dimensional softmax vector for classification.
3.2 Feature Precomputation
We aim to use minimal features to characterize local geometric information. For this purpose, we precompute the vertex normals and mesh Laplacian eigenvectors, which along with vertex positions are used as the input to our network. The mesh Laplacian provides an effective way to encode mesh connectivity. For later pooling layers, we use Laplacian spectral clustering [20] at multiple resolutions. Different from spectral convolutions in [16], our pooling layers reduce any number of vertices to a desired dimension, which makes it possible for our method to cope with meshes with different topology. In practice, the Laplacian matrix is computed by
$$\bm{L}={\bm{A}}^{1}(\bm{D}\bm{W})$$  (1) 
where $\bm{A}=diag({a}_{1},\mathrm{\dots},{a}_{n})$ are vertex weights defined as local Voronoi areas ${a}_{i}$, equal to one third of the sum of onering triangles areas. $\bm{W}={\{{w}_{ij}\}}_{i,j=1,\mathrm{\dots},N}$ is the sparse cotangent weight matrix, discretization of continuous Laplacian over mesh surfaces [33], and $\bm{D}$ is the degree matrix which is a diagonal matrix with diagonal entries ${d}_{ii}={\sum}_{j=1}^{N}{w}_{ij}$. The Laplacian feature $\mathbf{\Phi}$ is the eigenvectors of matrix $\bm{L}$. Please refer to [20] for detailed computation and applications of graph Laplacian. To cluster vertices at different levels, we perform kmeans clustering on $\mathbf{\Phi}$ with different numbers of clusters $k={p}_{l}$ such that vertices are clustered into ${p}_{l}$ clusters for the ${l}^{\mathrm{th}}$ pooling block. $l=1,2,\mathrm{\dots},L$, and $L$ is the number of levels. To achieve localglobal feature extraction, ${{p}}_{{l}}$ decreases as ${l}$ increases. Note that since the clustering is in the feature domain, vertices on the surface within one cluster are not necessarily connected. This is also reasonable because some semantically similar vertices can be far away.
3.3 Network Architecture
The architecture of our LaplacianNet is illustrated in Figure 1. Given the preprocessed feature function $f$ defined on vertices $\mathcal{V}$, our LaplacianNet takes the feature matrix as input, each row being a feature vector of a vertex. By reducing the input features to a matrix, we avoid the complex graph structure and make it tractable for neural networks. Several mesh pooling blocks (MPBs) are then applied in various resolutions for multiple times. At the end of MPB blocks, the application network outputs the target function $g$. Details about pooling blocks will be further discussed in the next section. The architecture of the application network for classification and segmentation is shown in Figure 2. In our design, to circumvent the complex and irregular topology of mesh data, we seek a pipeline that can concisely describe the relationship among vertices. Instead of directly processing edges $\mathcal{E}$, we simplify this problem by only processing vertices $\mathcal{V}$ of mesh $\mathcal{G}$. Nevertheless, edges are not ignored, but instead implicitly encoded into Laplacian eigenvectors and spectral clusters as described in the previous subsection. Since the mesh Laplacian is intrinsically induced from geodesic distances, our method is robust to remeshing and isometric transformations. Moreover, since the total number of vertices in a mesh can vary significantly from model to model, an ideal network architecture should be able to deal with meshes with different numbers of vertices. Our solution is to design mesh pooling blocks, which turn meshes of arbitrary sizes into levels with the same number of clusters. By stacking together several blocks in a multiresolution manner, our network can learn to extract useful features from the mesh. In addition, our network uses parameters more effectively with shared weight MultiLayer Perceptron (MLP), which also helps avoid overfitting by reducing the complexity of our network. Our method consequently achieves good results for shape classification and segmentation.
3.4 Mesh Pooling Blocks
A mesh pooling block is composed of three modules: the MultiLayer Perceptron (MLP) layers [34], the pooling layers, and a Correlation Net. Figure 1 shows an illustration of a mesh pooling block. Each mesh pooling block obtains its input feature ${{\bm{F}}}_{{l}}$ of size ${N}{\times}{{c}}_{{l}}$ from the previous layer. It also gets the cluster mask ${{\bm{M}}}_{{l}}$ from the precomputation step, where an entry ${{m}}_{{l}{,}{i}}{\in}{\{}{1}{,}{2}{,}{\mathrm{\dots}}{,}{{p}}_{{l}}{\}}$ in the mask ${{\bm{M}}}_{{l}}{\in}{{\mathbb{Z}}}^{{N}}$ indicates for node ${i}$ which cluster it belongs to. The total number of clusters for the ${{l}}^{{\mathrm{th}}}$ block is ${{p}}_{{l}}$. As illustrated in Figure 1, the data flows through three branches. The upper path is a series of MLP layers, learning vertex features of increasing complexity; the middle path is the pooling layer followed by a correlation matrix multiplication, which fuses global and local information; the bottom path is a Correlation Net that computes the correlation matrix, learning the interaction among clusters. The upper path of the MPB in Figure 1 is a set of MLP layers with shared weight perceptrons connected to all the vertices. For a certain vertex ${i}$, an MLP layer multiplies its input ${{\bm{F}}}_{{l}{,}{i}}$ and weight matrix ${\bm{W}}$ with bias ${\bm{b}}$, followed by a ${R}{}{e}{}{L}{}{U}{}{(}{\cdot}{)}$ activation function. The operation of this layer can be expressed as
$$MLP({\bm{F}}_{l,i})=ReLU(\bm{W}{\bm{F}}_{l,i}+\bm{b})$$  (2) 
For the pooling layer, the input includes features ${\bm{F}}_{l}$ from the last block and a cluster mask ${\bm{M}}_{l}$. Its result is defined as applying the operation to all the nodes belonging to the same cluster. Take max pooling as an example. The pooling result ${\bm{P}}_{l,j}$ corresponding to the ${j}^{\mathrm{th}}$ cluster of the ${l}^{\mathrm{th}}$ block is:
$${\bm{P}}_{l,j}=\underset{{m}_{l,i}=j}{\mathrm{max}}{\bm{F}}_{i}$$  (3) 
Furthermore, we want the features to be computed across clusters. For images, convolutional kernels can be used to fuse pooling results. However, since triangle meshes do not have a regular grid and consistent clustering, such an approach does not work. Standard global pooling can be a simple choice, but each cluster has equal contribution and detailed information is lost. At the bottom path in Figure 1, to aggregate information from all clusters, we multiply the pooling results with a correlation matrix ${{\bm{C}}}_{{l}}{=}{{\{}{{c}}_{{i}{}{j}}{\}}}_{{{p}}_{{l}}{\times}{{p}}_{{l}}}$. Each entry ${{c}}_{{i}{}{j}}$ measures the correlation between the ${{i}}^{{\mathrm{th}}}$ and ${{j}}^{{\mathrm{th}}}$ clusters, such that the aggregated pooling result is obtained as $\stackrel{{~}}{{{\bm{P}}}_{{\bm{l}}}}{=}{{\bm{C}}}_{{\bm{l}}}{}{{\bm{P}}}_{{\bm{l}}}$. The Correlation Net computes the matrix ${\bm{C}}$ by learning latent vector embedding for each cluster:
$${\mathbf{\Psi}}_{l}=\{{\psi}_{lj}\}=Pooling(MLP({\bm{F}}_{l})),$$  (4) 
and entries of the correlation matrix are inner products of latent vectors of cluster pairs $$.
$${\stackrel{~}{\bm{P}}}_{l}={\mathbf{\Psi}}_{l}{\mathbf{\Psi}}_{l}^{\bm{T}}{\bm{P}}_{l}$$  (5) 
The concatenation layer combines vertex features from the upperpath MLPs and aggregated pooling results. For the vertex $i$ in cluster $j$, its output feature is written as
$$\{MLP({\bm{F}}_{l,i}),{\stackrel{~}{\bm{P}}}_{l,j}\}$$  (6) 
In summary, the MPB is used for both processing the local information and finding the relationships among local patches. The motivation for pooling in the mesh is to better understand spatial information. Ideally, the input vertices are hierarchically clustered into different areas, and the relationships between areas need to be considered. In practice, we process the hierarchical information by clustering vertices into different numbers of clusters. The Correlation Net learns to compute a correlation matrix to describe the relative relationships among spatial areas after pooling. Using Laplacian clustering ensures that the clustering is robust under different triangulation. Also, the Euclidean distance in the Laplacian feature domain approximates the geodesic distance [9]. Therefore, such a clustering can find meaningful patches in the spatial domain. Moreover, the clustered areas give a reasonable partition, as shown in the camel example in Figure 1 where vertices with the same color roughly belong to one semantic area.
4 Experiments
4.1 Implementing Details
We implement our method using Tensorflow. We train and test the network on a desktop with an NVIDIA GTX1080 GPU. The optimization is achieved using the Adam [35] solver with learning rate $7\times {10}^{4}$ and momentum $0.9$. The network is trained for 200 epochs with batch size 8. The input features have 22 dimensions, including 6 dimensions of positions and normals, and the other 16 dimensions are the absolute values of the Laplacian eigenvectors corresponding to the 16 lowest frequencies. According to the experimental results in the next section, by default we choose to use two Mesh Pooling Blocks with 16 and 8 clusters respectively. There are two MLP layers in a mesh pooling block outputting 128 and 256 channels. Following MPBs is a specific application network based on the task to be performed. In total, the network’s depth is ${11}$ for the segmentation and classification tasks.
4.2 Network Evaluation
We now evaluate different parts of our network. This series of evaluation is conducted on the COSEG dataset, with results shown in Tab. I. First we test the usefulness of the input features. The performance of a certain algorithm can be affected by the features used. In the experiments, we use coordinates, normals and Laplacian eigenvectors as vertex features. In this part, we test the network without one of those three features. We can see that a combination of all the features achieves the best results. Second, we test the setting of Mesh Pooling Blocks. By default, our method uses 2 MPBs. We compare this with alternative numbers of MPBs ranging from 0 to 3. The results show that 2 MPBs (our default method) gives the best performance. We also test the usefulness of our Correlation Net. OursnoCorrNet is the network without the Correlation Net and matrix ${\bm{C}}$. We observe that the the aggregation of global information is important for our network. Third, another difficulty when handling mesh data is varying mesh connectivity. As mentioned before, our method is robust under different mesh triangulation as a result of the pooling strategy. The Laplacian eigenfunction is induced from geodesic distances and therefore invariant under isometric transformation, so the pooling areas as well as input features can essentially stay unchanged when we remesh the object. We visualize the connectivity of the objects before and after remeshing in Figure 3, which is obtained by subdivision followed by mesh decimation [36] to generate irregular connectivity. Moreover, the quantitative results (Oursremesh) in Tab. I show that our network is robust under different triangulation. Fourth, our network does not rely on the same vertex numbers for models. An experiment is performed to test this. In this experiment, we first simplify COSEG models to 1500, 2000 and 2500 vertices. Then we split the training and test set in the same strategy for all three resolutions. We train our network on a mix of two of the datasets and test models from the third (Ours1500, Ours2000 and Ours2500 in the table). Accuracies in Tab. I show that our network works well with varying vertex numbers, compared to the last row where LaplacianNet is trained with models all containing 2048 vertices.
4.3 Part Segmentation
In this section, we use MPBs to conduct part segmentation on the ShapeNet [26, 37] and COSEG [27] datasets. The application network for segmentation is illustrated in Figure 2(top row). Its input is the category label and features from the previous MPB. The output is a softmax score for each category. The application network has two perceptron layers, a maxpooling layer and two fully connected layers. We minimize the crossentropy loss between the onehot vector of ground truth and the network output. ShapeNet is a large repository of shapes from multiple categories. To leverage this dataset to perform segmentation, Yi et al. [26] develop an active learning method for efficient annotation. However, their annotations are not directly on the mesh vertices, but on the point cloud resampled from the shapes. To recover the graph structure of manifold surfaces for computation of mesh Laplacian and segmentation, we apply [38] on the original ShapeNet models. After that, we transfer the annotations on the point clouds to the nearest mesh vertices. Two metrics are used in previous segmentation results on ShapeNet, namely accuracy and IoU (Intersectionoverunion). We compute both of them to compare with stateoftheart deep learning methods on 3D shapes [12, 31], and some other methods performing segmentation on ShapeNet based on point clouds such as [39]. Tab. II shows that our method achieves the highest average accuracy, and outperforms stateoftheart methods on 13 out of 16 categories. In terms of IoU, our performance is comparable to the stateoftheart, achieving the best performance in 8 categories. Some segmentation results are presented in Figure 4.
Chair  Vase  Telealien  
Xie et al. [40]  87.1%  85.9%  83.2% 
Wang et al. [30]  95.9%  91.2%  93.0% 
OursnoMPB  76.6%  77.8%  80.7% 
Ours1MPB  85.3%  86.1%  88.9% 
Ours3MPBs  90.1%  92.2%  91.6% 
OursnoCorrNet  86.9%  85.6%  90.7% 
Ours1500  90.3%  90.0%  89.0% 
Ours2000  90.9%  91.6%  89.3% 
Ours2500  87.0%  86.6%  88.5% 
Oursremesh  92.1%  91.5%  91.8% 
OursnoCoordinates  90.6%  88.6%  84.2% 
OursnoNormal  87.6%  86.1%  85.0% 
OursnoLaplacian  79.6%  87.1%  86.1% 
Ours  94.2%  92.2%  93.9% 
The COSEG [27] dataset is also a commonly used benchmark for shape segmentation. Compared to ShapeNet, COSEG is much smaller. It has 8 scanned categories and 3 synthetic categories. Each of the 8 categories has around 20 models, which are too few for deep learning. The 3 synthetic categories each have 900 models, so we test our algorithm with the synthetic categories. We compare our result with [40] and [30] in Tab. I. Our approach outperforms both of them.
Method  input  MN10  MN40  ShapeNet 

PointNet  point    89.2%   
PointNet++  point    91.9%   
SONet  point  95.7%  93.4%   
3DShapeNets  volume  83.5%  77.0%   
Voxnet  volume  91.0%  84.5%   
ACNN  mesh      93.99% 
SyncSpecCNN  mesh      99.71% 
SPH  mesh    68.2%   
Ours  mesh  97.4%  94.21%  99.88% 
4.4 Mesh Classification
In this section, we evaluate the accuracy of object categorization. The input features to this task are the same as the segmentation task. The applicationspecific network following Mesh Pooling Blocks is shown in Figure 2(bottom row). It contains two perceptron layers, a maxpooling layer and two fully connected layers. In this case the output is a softmax score for each category. We minimize the crossentropy loss between onehot vector of ground truth and network output. For this classification task, we use ModelNet10, ModelNet40 [5] and ShapeNet. In each experiment, the network is trained for 200 epochs. Our method outperforms all stateoftheart single model shape classification approaches in all the datasets.
4.5 Failure Cases
We would like to restate that our method does not work directly on the original ShapeNet models for two reasons: 1) the annotations are labeled on point clouds uniformly sampled from shapes, instead of vertices of the mesh; 2) meshes in ShapeNet are not manifold meshes, preventing us from performing highquality spectral clustering. Therefore we convert shapes into watertight manifold surfaces using [38], simplify the mesh to a reasonable level, and transfer part labels to vertices from the point cloud through nearest neighbor matching. Error accumulates during this process. In Figure 5 we show some failure cases when performing part segmentation in ShapeNet. There are several typical problems for models that are poorly segmented. In the chair example, we can observe sharp edges in ground truth labels, and such artifacts could be caused by the noise in the original point cloud. This kind of noise makes learning reasonable segmentation more challenging. As in the bag case, we can observe in the red circle that the ground truth is incorrect. This may be caused by changes of the shape while using [38] to make the model watertight, so that the nearest neighbor algorithm gets the wrong correspondence when transferring labels. We can see that our algorithm actually gets a more correct answer than the “ground truth”. Failure in the motorbike case is caused by lack of training data. In the annotations, the motorbike class has the fewest data samples. Worse still, some of the training examples fail to be converted when performing [38], because motorbikes have complex topological structure, making transfer challenging. In practice, it can be a big problem that errors in the transfer and simplification will decrease the amount of training data. Last but not least, the skateboard shows that there might be some holes, or other artifacts, in the surface, that could affect graph structure and mislead our algorithm.
5 Conclusion
In this paper, we present a deep learning approach to predicting functions defined on shapes. The key idea is to perform multiscale pooling based on Laplacian spectral clustering, and use a Correlation Net following pooling to fuse global information. Compared to the pooling operation in the previous literature, our network does not require a uniform number of vertices in each model. Our work outperforms stateoftheart methods in most categories for shape classification and segmentation. Several future extensions can be explored. First our LaplacianNet may be applied to other tasks. For example, 3D reconstruction is fundamental but challenging. A capable and general method to generate meshes is of great demand. Our network has the potential to achieve this task, because it can neatly encode connectivity in vertices, and intrinsically understands the topology through spectral clustering and spatial pooling. Furthermore, as a general structure for mesh processing, our network may also be applied to shape deformation, completion and correspondence. Another area of future work is to design a kernel with trainable parameters, like convolutional kernels in images. For instance, a Gaussian kernel can be calculated inside a pooling cluster. Meanwhile, the transpose convolution plays an important role in traditional CNN frameworks. The combination of forward and transpose convolutions lets the network freely downsample and upsample data. Finally it would be interesting to see how our network can work on general graphs. In this paper, we mainly deal with manifold meshes, using geodesic distances to construct Laplacian. However in other problem settings, we can use any distance metric that can best describe the problem. For example, we might experiment with our method on social networks with arbitrary size and connectivity.
Acknowledgments
This work was supported by National Natural Science Foundation of China (No. 61872440 and No. 61828204), Beijing Natural Science Foundation (No. L182016), Young Elite Scientists Sponsorship Program by CAST (No. 2017QNRC001), Youth Innovation Promotion Association CAS, Huawei HIRP Open Fund (No. HO2018085141), CCFTencent Open Fund, SenseTime Research Fund and Open Project Program of the National Laboratory of Pattern Recognition (NLPR).
References
 [1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, “Largescale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
 [2] M. Chen, G. Ding, S. Zhao, H. Chen, Q. Liu, and J. Han, “Reference based LSTM for image captioning.” in AAAI, 2017, pp. 3981–3987.
 [3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
 [4] S. Kwak, S. Hong, B. Han et al., “Weakly supervised semantic segmentation using superpixel pooling network.” in AAAI, 2017, pp. 4111–4117.
 [5] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
 [6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond Euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [7] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst, “Geodesic convolutional neural networks on Riemannian manifolds,” in Proceedings of the IEEE international conference on computer vision workshops, 2015, pp. 37–45.
 [8] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein, “Learning shape correspondence with anisotropic convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 3189–3197.
 [9] K. Crane, C. Weischedel, and M. Wardetzky, “The heat method for distance computation,” Commun. ACM, vol. 60, no. 11, pp. 90–99, Oct. 2017. [Online]. Available: http://doi.acm.org/10.1145/3131280
 [10] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 [11] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph convolutional neural networks,” arXiv preprint arXiv:1801.03226, 2018.
 [12] L. Yi, H. Su, X. Guo, and L. J. Guibas, “SyncSpecCNN: Synchronized spectral CNN for 3d shape segmentation.” in CVPR, 2017, pp. 6584–6592.
 [13] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multiview CNNs for object classification on 3d data,” in CVPR, 2016, pp. 5648–5656.
 [14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3d classification and segmentation,” Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, vol. 1, no. 2, p. 4, 2017.
 [15] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems, 2017, pp. 5099–5108.
 [16] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 [17] E. Kalogerakis, A. Hertzmann, and K. Singh, “Learning 3d mesh segmentation and labeling,” ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 102, 2010.
 [18] J. Sun, M. Ovsjanikov, and L. Guibas, “A concise and provably informative multiscale signature based on heat diffusion,” in Computer graphics forum, vol. 28, no. 5. Wiley Online Library, 2009, pp. 1383–1392.
 [19] M. Aubry, U. Schlickewei, and D. Cremers, “The wave kernel signature: A quantum mechanical approach to shape analysis,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 1626–1633.
 [20] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
 [21] Z. Wu, Y. Wang, R. Shou, B. Chen, and X. Liu, “Unsupervised cosegmentation of 3d shapes via affinity aggregation spectral clustering,” Computers & Graphics, vol. 37, no. 6, pp. 628–637, 2013.
 [22] Z. Shu, C. Qi, S. Xin, C. Hu, L. Wang, Y. Zhang, and L. Liu, “Unsupervised 3d shape segmentation and cosegmentation via deep learning,” Computer Aided Geometric Design, vol. 43, pp. 39–52, 2016.
 [23] Q. Huang, V. Koltun, and L. Guibas, “Joint shape segmentation with linear programming,” ACM Transactions on Graphics (TOG), vol. 30, no. 6, p. 125, 2011.
 [24] O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. CohenOr, Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering. ACM, 2011, vol. 30, no. 6.
 [25] X. Chen, A. Golovinskiy, and T. Funkhouser, “A benchmark for 3d mesh segmentation,” ACM Transactions on Graphics (TOG), vol. 28, no. 3, p. 73, 2009.
 [26] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas et al., “A scalable active framework for region annotation in 3d shape collections,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, p. 210, 2016.
 [27] Y. Wang, S. Asafi, O. Van Kaick, H. Zhang, D. CohenOr, and B. Chen, “Active coanalysis of a set of shapes,” ACM Transactions on Graphics (TOG), vol. 31, no. 6, p. 165, 2012.
 [28] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The Princeton shape benchmark,” in Shape modeling applications, 2004. Proceedings. IEEE, 2004, pp. 167–178.
 [29] D. George, X. Xie, and G. K. Tam, “3d mesh segmentation via multibranch 1d convolutional neural networks,” Graphical Models, vol. 96, pp. 1–10, 2018.
 [30] P. Wang, Y. Gan, P. Shui, F. Yu, Y. Zhang, S. Chen, and Z. Sun, “3d shape segmentation via shape fully convolutional networks,” Computers & Graphics, vol. 70, pp. 128–139, 2018.
 [31] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri, “3d shape segmentation with projective convolutional networks,” in Proc. CVPR, vol. 1, no. 2, 2017, p. 8.
 [32] K. Guo, D. Zou, and X. Chen, “3d mesh labeling via deep convolutional neural networks,” ACM Transactions on Graphics (TOG), vol. 35, no. 1, p. 3, 2015.
 [33] M. Meyer, M. Desbrun, P. Schröder, and A. H. Barr, “Discrete differentialgeometry operators for triangulated 2manifolds,” in Visualization and Mathematics III, 2003, pp. 35–57.
 [34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive Science, Tech. Rep., 1985.
 [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [36] M. Garland and P. S. Heckbert, “Surface simplification using quadric error metrics,” in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’97. New York, NY, USA: ACM Press/AddisonWesley Publishing Co., 1997, pp. 209–216. [Online]. Available: https://doi.org/10.1145/258734.258849
 [37] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “ShapeNet: An informationrich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
 [38] J. Huang, H. Su, and L. Guibas, “Robust watertight manifold surface generation method for ShapeNet models,” arXiv preprint arXiv:1802.01698, 2018.
 [39] N. Verma, E. Boyer, and J. Verbeek, “FeaStNet: Featuresteered graph convolutions for 3d shape analysis,” in CVPR 2018IEEE Conference on Computer Vision & Pattern Recognition, 2018.
 [40] Z. Xie, K. Xu, L. Liu, and Y. Xiong, “3d shape segmentation and labeling via extreme learning machine,” in Computer graphics forum, vol. 33, no. 5. Wiley Online Library, 2014, pp. 85–95.
 [41] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz, “Rotation invariant spherical harmonic representation of 3d shape descriptors,” in Symposium on geometry processing, vol. 6, 2003, pp. 156–164.
 [42] J. Li, B. M. Chen, and G. H. Lee, “SONet: Selforganizing network for point cloud analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9397–9406.
 [43] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for realtime object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 922–928.