Geometric Feedback Network for Point Cloud Classification

  • 2019-12-02 02:54:28
  • Shi Qiu, Saeed Anwar, Nick Barnes
  • 0

Abstract

As the basic task of point cloud learning, classification is fundamental butalways challenging. To address some unsolved problems of existing methods, wepropose a network designed as a feedback mechanism, a procedure allowing themodification of the output via as a response to the output, to comprehensivelycapture the local features of 3D point clouds. Besides, we also enrich theexplicit and implicit geometric information of point clouds in low-level 3Dspace and high-level feature space, respectively. By applying an attentionmodule based on channel affinity, that focuses on distinct channels, thelearned feature map of our network can effectively avoid redundancy. Theperformances on synthetic and real-world datasets demonstrate the superiorityand applicability of our network. Comparing with other state-of-the-artmethods, our approach balances accuracy and efficiency.

 

Quick Read (beta)

Geometric Feedback Network for Point Cloud Classification

Shi Qiu1,2, Saeed Anwar1,2 and Nick Barnes1
1Australian National University   2Data61 - CSIRO
{shi.qiu, saeed.anwar}@data61.csiro.au, [email protected]
Abstract

As the basic task of point cloud learning, classification is fundamental but always challenging. To address some unsolved problems of existing methods, we propose a network designed as a feedback mechanism, a procedure allowing the modification of the output via as a response to the output, to comprehensively capture the local features of 3D point clouds. Besides, we also enrich the explicit and implicit geometric information of point clouds in low-level 3D space and high-level feature space, respectively. By applying an attention module based on channel affinity, that focuses on distinct channels, the learned feature map of our network can effectively avoid redundancy. The performances on synthetic and real-world datasets demonstrate the superiority and applicability of our network. Comparing with other state-of-the-art methods, our approach balances accuracy and efficiency.

1 Introduction

Point clouds are one of the fundamental representations of 3D data, and widely used for both academia and industry because of the development of 3D sensing technology and relevant applications. Generally, 3D point clouds can be collected by 3D scanners [4] utilizing physical touch or non-contact measurements e.g. light, sound, LiDAR etc. Particularly, LiDAR scanners [9] are in service in many areas including agriculture, biology, and robotics, etc. Due to its tremendous contributions, point cloud analysis attracts much interest for further investigation.

Previously, traditional algorithms [30, 22, 29, 38] that operate on 3D data incorporated estimated geometric information and reconstructed models. With the help of deep learning, recent works on 3D data focus on data-driven approaches via Convolutional Neural Networks (CNNs). Classical works can be generally categorized as: multi-view images with 2D CNNs (e.g. MVCNN [32]), volumetric/mesh data with 3D CNNs (e.g. VoxNet [20]), 3D point cloud with multi-layer perceptrons (MLP) (e.g. [26]), etc.

Currently, many works are exploring different methods to improve the processing of 3D point clouds, but there remains some unsolved issues: (1) How can we force the network to automatically learn a better representation from the abstract high-level embedding space? (2) How do we refine the output features in order to focus on the crucial information? (3) Besides the regular features, can we learn from more geometric clues for a comprehensive analysis?

Figure 1: Enriched geometric features. In low-level space, we explicitly estimate geometric terms e.g. edges (green vectors), normals (red vectors), etc. While in high-level space, we aggregate neighbors (green dots) to capture both prominent features (red dots) and fine-grained features (purple dots).

To investigate the possible answers to the above concerns, here we present a novel attentional feedback structure for point cloud learning via the incorporation of geometric context. As supported by substantial biological evidence, feedback mechanisms are very helpful for visual tasks, because feedback paths in our brain especially in the visual cortex consist of more neurons than the forward paths. Besides, feedback mechanisms also have been successfully utilized for stable and responsive systems in industry. Some early works managed to involve this mechanism in CNNs to deal with computer vision problems like 2D visual segmentation [5] or 3D hand pose estimation [24] etc. However, for point clouds, PU-GAN [13] to some extent used a feedback-like unit but for feature expansion inside its generator module. Our primary motivation to investigate feedback is to automatically refine the output feature map by comparing the difference between the input and the corresponding feedback signal. Using this approach, we enable the network to generate a better learning output.

Regarding high-level feature presentation for 3D point clouds, another widely applied mechanism: Attention, can assist the network to put more emphasis on useful information [36]. Attention modules are used in many 2D visual problems (e.g. image segmentation [42, 8, 7], image denoising [1]), etc. For 3D point cloud analysis, we design a Channel-wise Affinity Attention module for feature map enhancement based on affinity between channels.

To deal with the unorderedness of regular point cloud data, PointNet [26] proposed to use symmetric functions to aggregate features, while most of the subsequent works apply max-pooling to extract prominent features. Despite the fact that prominent features are representative, they lack some details especially in local areas, hence, are insufficient for precise classification task. To address this problem, we propose a simple but effective way to learn fine-grained features via a shared fully connected (FC) layer on each local neighborhood. Although recent work [27, 18, 41, 47, 12] shows that CNN approaches can benefit from more geometric information, it could also negatively affect the performance since redundant or useless features can be incorporated. To maximize its advantage, we shall carefully form a low-level descriptor with explicit physical meaning to enrich the geometric information.

The contributions of our work can be summarized as:

  • We design a feedback CNN mechanism for local prominent feature learning on point clouds.

  • We introduce a Channel-wise Affinity Attention module to better refine high-level features of point clouds.

  • We propose an intuitive but effective structure to extract local fine-grained features as complement. And we also show that our estimated mesh descriptor can significantly improve the performance of the network.

  • We present experimental results showing that our proposed network on both synthetic and real-world 3D point cloud classification benchmarks outperform state-of-the-art methods.

Figure 2: Our network architecture for point cloud classification. In the beginning, the Naive Mesh Descriptor (NMD) can explicitly expand low-level 3D coordinates to a 14-degree geometric feature. Then a series of modules comprehensively learn the representation of the point cloud in high-level embedding spaces. The Attentional Feedback Module for EdgeConv (AFME) can aggregate prominent featurse of a local area, while the Fine-grained Edge Feature Extractor (FEFE) is able to provide more local details as complement. Further, the Channel-wise Affinity Attention (CAA) module is applied to avoid redundant information and refine the feature map along channels. Finally, we use the concatenated max pooling and average pooling results to form a global vector, by which the subsequent fully connected layers can regress the scores for possible classes.

2 Related Work

Estimating geometric relations. Although 3D scattered point cloud data has many advantages, the main drawback is the lack of geometric information. In order to acquire more underlying knowledge of point clouds, conventional methods  [23, 21, 38] tried to estimate the geometry of point cloud e.g. face, normals, curvature, etc. and also proposed many hand-crafted features for recognition and matching, (e.g. shape context[11], point histograms[29], etc.), Besides, recent works [27, 18, 43, 46] with CNNs have better performance thanks to the permutation invariance of low-level geometry. To give advantages for the needs of modern methods, here we expand low-level geometric information from the given 3D coordinates, then integrate this as an explicit geometric descriptor for network processing.

Learning local features. PointNet [26] passes point cloud data through MLPs for a high-level feature representation of every single point, and successfully solved the unordered problem of point cloud data with a symmetric function. Due to the effectiveness, later works [27, 15, 43, 18] also adopt MLP based operation for point cloud processing. Meanwhile, researchers realize that local features are promising because they contribute additional characteristics to global features. Although the points are unordered in point cloud data, we may group points based on various metrics. Generally, one approach selects seed points as centroids, and then applies a query algorithm (e.g. Ball Query in [27]) based on 3D Euclidean distance to group points for local clusters. After extracting local features, the network may further process the centroids.

Another track is to find each point’s neighbors in embedding space based on N-dimensional Euclidean distance [43] and then group each point’s neighbors in the form of high dimensional vectors. In contrast to the previous type, this method can avoid sparsity and update dynamically in different feature dimensions. In terms of feature aggregation, max-pooling [26] is widely employed since it can solve the issue of unorderedness and gather information sufficiently. In spite of the benefits, there are some weak points of the current max pooling approach: it may lose local details or involve bias. To overcome these problems, we add complementary local fine-grained features and feedback structures to reduce possible bias.

Attention mechanism for CNN. The idea of attention has been successfully used in many areas of Artificial Intelligence (AI). Like human beings, the computational resource of machine is also limited. Thus we need to focus on important aspects. Previously, Vaswani et al[36] proposed different types of attention mechanisms for neural machine translation. Subsequently, attention mechanisms were incorporated in visual tasks, for example, Wang et al[42] extended the idea of Self-attention in spatial domain for computer vision problems. Also, SENet [8] credits winning the ImageNet [28] challenge to its channel-wise attention module. Other works [44, 7, 14] derive benefits from both spatial and channel domains of 2D images.

In terms of 3D point clouds, attention modules contribute to point clouds detection [25], generation [33], segmentation [48, 16, 49], etc. However, limited work is done in well-designed attention mechanisms targeting 3D point clouds classification. On this front, Xie et al[46] utilized a spatial self-attention module for the shape context [11] of point clouds. Subsequent works [40, 6] also applied Graph Attention [37] module for the constructed graph features on point clouds. Differently from existing methods, we try to enhance the high-level representation of point cloud by capturing the long-range dependencies along its channels.

3 Approach

As stated in Section 1, there are some unsolved problems of the existing methods. To tackle the challenges, we start from classical MLP mapping [26] and edge convolution [43]. We apply the basic MLP operation [26] i.e. a 1×1 convolution on the point cloud feature map, usually followed by a batch normalization layer and an activation function:

():=τ(𝑩𝑵(c1×1())), (1)

where is the MLP, τ is the activation function, 𝑩𝑵 is batch normalization, c is convolution and its subscript presents the filter size. Moreover, edge features fψ defined in [43] is given by:

fψ(xi)=(xi,xj-xi),xid;xjNeighbors(xi) (2)

where xi is the centroid of a local area in d-dimensional feature space, and the entire feature map can be formed as 𝒳N×d=[x1T,x2T,,xNT]. The xj’s are the neighbors found by k-nearest neighbors (knn) algorithm.

The aim of EdgeConv, , is to construct radial local graphs (i.e. edge features fψ) consisting of edges pointing from the neighbors to the centroids, and then mapped by a shared MLP in feature space:

():=(fψ()) (3)

Differently from these approaches, we attempt to incorporate more geometric features in different levels. As illustrated in Figure 2, our network has a series of modules with each consisting of two main parts: one extracting prominent features and the other learning fine-grained features. Our mesh descriptor explicitly expands geometric information in low-level space to meet the demand of comprehensive learning. In this section, we introduce the critical modules in detail and mathematically formulate the operations.

3.1 Attentional Feedback Module for EdgeConv

The typical feedback mechanism aims to accurately control a process by monitoring its actual output and feeding the error signal back to force the system to generate the desired output. To be more specific, the output is passed through a feedback path as a feedback signal, and then the forward path can use such feedback signals to both adjust and control the system. By forming such a closed loop, the system can reduce the error, improve stability, and enhance robustness. Inspired by this, we propose the Attentional Feedback Module for EdgeConv (AFME) illustrated in Figure 3 for edge-based prominent features.

Figure 3: Attentional Feedback Module for EdgeConv (AFME). In general, we have three paths named the forward path (in green), the feedback path (in red), and the correction path (in blue). The error signal is defined as the difference between the feedback signal and the original input of AFME. By summing the inputs of forward and correction paths, we apply max pooling to extract local maximums and a Channel-wise Affinity Attention (CAA) module to refine them further. (D*ashLine: the edge features of the forward path will be additionally used as the input of corresponding Fine-grained Edge Feature Extractor (FEFE))

Forward path. Here we employ EdgeConv as the forward path denoted as 𝚽 of our AFME since this operation can explicitly capture both global shape structure and local neighborhood information. With Equation 2 and 3, the forward path as can be formulated as:

𝚽(ϕi,xi)=(fψ(xi))=(xi,xj-xi) (4)

And the output of forward path is:

fΦi=𝚽(ϕi,xi);fΦid×k (5)

Feedback and error signals. Following our forward path 𝚽, an ideal output feature map should fully encode global and local details. On the other hand, if the output fΦi is indeed informative, we may restore the original input xi from it. Suppose our feedback path (𝚼) aims to restore the input, then 𝚼’s output, xi, can be termed as a FeedbackSignal:

xi =𝚼(υi,fΦi) (6)
=𝚼(υi,𝚽(ϕi,xi));xid

Further, we take the difference between the original and restored inputs, Δxi, to formulate corresponding ErrorSignal:

Δxi =xi-xi (7)
=𝚼(υi,𝚽(ϕi,xi))-xi;Δxid

Feedback path. Since we attempt to restore the input from fΦid×k, the feedback path needs to simulate a reverse process of the forward path (i.e. EdgeConv). In our proposal, we apply a shared Local Fully Connected (LFC) layer as the feedback path 𝚼 of our module:

():=τ(𝑩𝑵(c1×k())) (8)

where is the LFC. Mathematically, LFC is a special case of a shared MLP with kernel size [1,k], by which the k neighbors in feature space can be fully connected. In contrast to EdgeConv expanding a center point to local neighbors in the new embedding space, the shared LFC layer can pull the neighbors back to the previous space (dd) and aggregate at a center point (k1) via learnable weights. In general, the feedback path follows:

𝚼(υi,fΦi)=(fΦi) (9)

Based on Equations 5, 6 and 7, we can re-write the output of feedback path, i.e. FeedbackSignal as:

xi=(𝚽(ϕi,xi));xid (10)

and ErrorSignal as:

Δxi=(𝚽(ϕi,xi))-xi;Δxid (11)

Correction path. Finally, the module passes the ErrorSignal through a correction path (𝚪), which has the same structure with forward path:

𝚪(γi,Δxi)=(Δxi)=(fψ(Δxi)) (12)

Therefore, the output features of the correction path can be formed as:

fΓi=𝚪(γi,Δxi);fΓid×k (13)

In order to form the feedback loop, here we take the output of the correction path as the correction term for our original output of forward path. After that, we apply max-pooling over the local area to obtain a compact feature map. Moreover, the Channel-wise Affinity Attention (CAA, see Section 3.2 for details) module can further refine the final feature representation.

Finally, the operation of Attentional Feedback Module for EdgeConv (AFME) can be summarized as:

fi =𝑨𝑭𝑴𝑬(xi) (14)
=𝑪𝑨𝑨(max{k}(fΦi+fΓi));fid

3.2 Channel-wise Affinity Attention Module

As mentioned in Section 2, most attention designs regarding point clouds operate in point-space, but the effects are not apparent. Instead, we prefer distributing attention weights along channels. Inspired by the spacetime non-local block [42], we can calculate the long-range dependencies without being concerned by point cloud data’s unoderedness. However, the corresponding calculations also have a high computational cost. Therefore we ought to find an appropriate method to avoid redundancy and refine the information in an abstract embedding space effectively and efficiently.

Figure 4: Channel-wise Affinity Attention module (CAA). Specifically, Compact Channel-wise Comparator (CCC) can approximate the similarity matrix between the channels of input feature map. Then Channel Affinity Estimator (CAE) takes the similarity matrix for the calculation of affinity matrix.

We propose our Channel-wise Affinity Attention (CAA) module targeting the channels of high-level point cloud feature maps. As Figure 4 shows, the main structure of the CAA module includes a Compact Channel-wise Comparator (CCC) block, a Channel Affinity Estimator (CAE) block, and a residual connection.

Compact Channel-wise Comparator block. Since the CAA module mainly focuses on channels, it is necessary to reduce the computing cost caused by the complexity in point-space. As we claimed above, it is hard to select the key points in such abstract high dimensional space. In the case of a given d-dimensional feature map N×d, the Compact Channel-wise Comparator (CCC) block can simplify context in each channel by an shared MLP operating on channel vector ci(whereciN;andN×d=[c1,c2,,cd]) to implicitly replace N original points with a smaller number N=N/ratio;ratio>1. In contrast to explicitly selecting some points in abstract embedding space, CCC aims to efficiently reduce the size but sufficiently retain the information of each channel:

qi=q(ci);qiN
ki=k(ci);kiN

Specifically, q() and k() are two MLPs operating for QueryMatrix and KeyMatrix [36]:

𝒬N×d=[q1,q2,,qd]
𝒦N×d=[k1,k2,,kd]

and we apply the product of transposed QueryMatrix and KeyMatrix to estimate corresponding channel-wise SimilarityMatrix:

𝒮d×d=𝒬T𝒦

where 𝒮i,j approximates the similarity between the ith channel and the jth channel of the given feature map N×d.

Channel Affinity Estimator block. Typical self-attention structures used to calculate the long-range dependencies in spatial data based on inner-products, since the values can somehow represent the similarities between the items. In contrast, we define the non-similarities between channels and term it Channel Affinity. In our approach, the Channel Affinity Matrix of the feature map N×d, can be modeled:

𝒜d×d=𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝒆𝒙𝒑𝒂𝒏𝒅1d(maxd1(𝒮))-𝒮d×d) (15)

Particularly, we select the maximum similarities along the columns of 𝒮, and then expand them into the same size of 𝒮. By subtracting the original 𝒮 from the expanded matrix, the channels with higher similarities have lower affinities (illustrated in Figure 5(b)). Besides, 𝒔𝒐𝒇𝒕𝒎𝒂𝒙 is added to normalize the values, since 𝒜d×d is used as the weight matrix for refinement. In this way, channels can put higher weights on other distinct channels, thereby avoid aggregating similar/redundant information.

(a) Compact Channel-wise Comparator (CCC).
(b) Channel Affinity Estimator (CAE).
Figure 5: Compact Channel-wise Comparator (CCC) and Channel Affinity Estimator (CAE) blocks.

According to the weight matrix, we can refine each point’s features by taking the weighted sum of all channels. We apply another MLP, v(), to get the ValueMatrix as shown below:

𝒱N×d=[v1,v2,,vd]
vi=v(ci);viN

This process can be easily achieved by the multiplication between 𝒱N×d and the Channel Affinity Matrix. Additionally, we use a residual connection and learn a weight α to ease block training. The refined feature map by CAA is given below:

N×d=𝑪𝑨𝑨()=+α𝒱𝒜 (16)
Figure 6: Fine-grained Edge Feature Extractor

3.3 Geometric Features

For regular scattered point clouds, the given information about 3D coordinates is minimal. In our work, we attempt to enrich the geometric features of the point cloud from two aspects: (1) we describe the low-level relations explicitly, and (2) extract the high-level information implicitly.

Explicit geometric features. Here we define the explicit geometric features in low-level space to estimate features with explicit purposes. In geometry, a mesh is a type of well-constructed 3D data format, including faces, edges, as well as vertices. Similarly, we incorporate the estimated faces and edges to expand the low-level features representation of the 3D point clouds. Hence, we propose a Naive Mesh Descriptor (NMD) to enrich the original input data (i.e. 3D coordinates) with estimated face features.

Since most of the mesh data is constructed in triangle faces, we also adapt the point cloud for triangle mesh format. To be specific, firstly we search the two nearest neighbors, i.e. knn with k=2, in 3D space for point pi3, and then we form the triangle face corresponding to pi with the two neighbors: pj1,pj23. To explicitly describe the estimated triangle face, totally six items with exact geometric purposes are involved in:

p~i=(pi,normal,edge1,edge2,length1,length2) (17)

Concretely:

p~i14{pi=(x,y,z)pi3normal=edge1×edge2normal3edge1=pj1-piedge13edge2=pj2-piedge23length1=|edge1|length11length2=|edge2|length21

Implicit geometric features. In contrast to explicit geometry in low-level space, we also expect to capture more implicit information in high-level space. As explained in Section 3.1, the Attentional Feedback Module for EdgeConv (AFME) can extract local prominent features via a max-pooling function in a high-level space. Although the prominent features can encode much geometric information for simple point clouds, more details are needed. Especially for some challenging cases e.g. real objects, complex scenes, or similar shapes etc., more fine-grained features are required for comprehensive feature representation.

Specificly, the edge features fψ(xi) from a AFME are employed as the input of corresponding Fine-grained Edge Feature Extractor (FEFE, Figure 6). Instead of max-pooling for prominent features, the LFC layer can aggregate more details from all neighbors. Besides, CAA helps to refine the features for compact outputs. Therefore, the extracted fine-grained features are formulated:

f~i=𝑭𝑬𝑭𝑬(fψ(xi))=𝑪𝑨𝑨((fψ(xi))) (18)

4 Experiments

Table 1: Classification results (%) on ModelNet40 benchmark. (coords: (x,y,z) coordinates, norm: point normal, voting: multi-votes evaluation strategy, k:×210, -: unknown)
method input type #points avg class acc. overall acc.
ECC [31] coords 1k 83.2 87.4
PointNet [26] coords 1k 86.0 89.2
SCN [46] coords 1k 87.6 90.0
Kd-Net [10] coords 1k - 90.6
PointCNN [15] coords 1k 88.1 92.2
PCNN [2] coords 1k - 92.3
DensePoint [17] coords 1k - 92.8
RS-CNN [18] coords 1k - 92.9
DGCNN [43] coords 1k 90.2 92.9
KP-Conv [34] coords 1k - 92.9
Ours 𝒄𝒐𝒐𝒓𝒅𝒔 𝟏𝒌 91.0 93.8
SO-Net [12] coords 2k 87.3 90.9
PointNet++ [27] coords+norm 5k - 91.9
Spec-GCN [39] coords+norm 2k - 92.1
SpiderCNN [47] coords+norm 5k - 92.4
DensePoint [17] coords+voting 1k - 93.2
SO-Net [12] coords+norm 5k 90.8 93.4
DGCNN [43] coords 2k 90.7 93.5
RS-CNN [18] coords+voting 1k - 93.6

In this section, we first provide the implementation and training details followed by the datasets we utilize for evaluation. We then analyze our network to establish the effects of different modules. Furthermore, we visualize the outputs, discuss the complexity of our model, and conclude this with the performance against state-of-the-art methods on synthetic and real-world point clouds.

Implementation details. Our proposed network starts with a Naive Mesh Descriptor, which expands the input 3D coordinates into a 14-degree geometric vector. Next, the geometric features are passed through four modules to learn high-level features in different embedding spaces i.e. 64, 64, 128, and 256. Moreover, each module has an Attentional Feedback Module for EdgeConv (AFME, the number of neighbors k is 20) for extracting prominent features and a Fine-grained Edge Feature Extractor (FEFE) for local details.

To incorporate the information from different scales, we concatenate the output feature maps of the mentioned modules together, and a shared MLP with CAA module can further integrate them into a 1024 dimensional feature map. Then we apply max-pooling and average-pooling in parallel over all channels for a global vector, by which an additional three fully connected layers (having 512, 256, c output) can regress the confidence scores for all possible categories. In the end, we employ cross-entropy between predictions and ground-truth labels as our loss function.

Table 2: Ablation studies on different Naive Mesh Descriptor p~i forms on ModelNet40 classification accuracy (%). (pi: (x,y,z), n: normal, e: edge, l: |edge|. Please refer to Equation 17 for details.)
\Xhline3 model Naive Mesh Descriptor length overall acc.
1 p~i=(pi) 3 93.4
2 p~i=(pi,n,l1,l2) 8 93.5
3 p~i=(pi,e1,e2,l1,l2) 11 93.5
4 p~i=(pi,n,e1,e2) 12 93.7
5 p~i=(pi,n,e1,e2,l1,l2) 14 93.8
\Xhline3

Training. We apply Stochastic Gradient Descent (SGD) with the momentum of 0.9 as the optimizer for training, and its initial learning rate of 0.1 decreases to 0.001 by cosine annealing [19]. The batch size is set to 32, and the number of training epochs is 300. Besides, we augment the training data with random scaling and translation as in [43], while there is no pre or post-processing performed during testing.

Datasets. We show the performance of the proposed network on two classification datasets: a classical ModelNet40 [45], which contains synthetic object point clouds, and the recently introduced ScanObjectNN [35] composed of real-world object point clouds.

  • ModelNet40. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular because of its various categories, clean shapes, well-constructed dataset, etc. To be specific, the original ModelNet40 consists of 12,311 CAD-generated meshes in 40 categories, of which 9,843 are used for training while the rest 2,468 are reserved for testing. Moreover, the corresponding point cloud data points are uniformly sampled from the mesh surfaces, and then further preprocessed by moving to the origin and scaling into a unit sphere. For our experiments, we only input the (x,y,z) coordinates having 1024 points for each 3D point cloud.

  • ScanObjectNN. To further prove the effectiveness and robustness of our classification network, we conduct experiments on ScanObjectNN, a newly published real-world object dataset that has about 15,000 objects in 15 categories. Although it has fewer categories than ModelNet40, it is more practically challenging than its synthetic counterpart due to the background, missing parts, and various deformations.

4.1 Ablation studies

To verify the functions and effectiveness of different parts in our network, here we conduct two ablation studies about the proposed modules and the contents of Naive Mesh Descriptor, respectively. We investigate the same proposed network on the ModelNet40 dataset.

Table 3: Ablation Studies for different modules of our classification network on ModelNet40 (%). (FME: Feedback Module for EdgeConv, CAA: Channel-wise Affinity Attention model, FME+CAA: AFME in Section 3.1, FEFE: Fine-grained Edge Feature Extractor, NMD: Naive Mesh Descriptor.)
\Xhline3 model FME CAA FEFE NMD overall acc.
baseline 92.6
1 92.9
2 93.0
3 93.2
4 93.4
5 93.3
6 93.1
7 93.4
8 93.8
\Xhline3
Table 4: Classification results (%) on ScanObjectNN benchmark.
overall acc. avg class acc. bag bin box cabinet chair desk display door shelf table bed pillow sink sofa toilet
# shapes - - 298 794 406 1344 1585 592 678 892 1084 922 564 405 469 1058 325
3DmFV [3] 63 58.1 39.8 62.8 15.0 65.1 84.4 36.0 62.3 85.2 60.6 66.7 51.8 61.9 46.7 72.4 61.2
PointNet [26] 68.2 63.4 36.1 69.8 10.5 62.6 89.0 50.0 73.0 93.8 72.6 67.8 61.8 67.6 64.2 76.7 55.3
SpiderCNN [47] 73.7 69.8 43.4 75.9 12.8 74.2 89.0 65.3 74.5 91.4 78.0 65.9 69.1 80.0 65.8 90.5 70.6
PointNet++ [27] 77.9 75.4 49.4 84.4 31.6 77.4 91.3 74.0 79.4 85.2 72.6 72.6 75.5 81.0 80.8 90.5 85.9
DGCNN [43] 78.1 73.6 49.4 82.4 33.1 83.9 91.8 63.3 77.0 89.0 79.3 77.4 64.5 77.1 75.0 91.4 69.4
PointCNN [15] 78.5 75.1 57.8 82.9 33.1 83.6 92.6 65.3 78.4 84.8 84.2 67.4 80.0 80.0 72.5 91.9 71.8
Ours 80.5 77.8 59.0 84.4 44.4 78.2 92.1 66 91.2 91.0 86.7 70.4 82.7 78.1 72.5 92.4 77.6

Effects of different modules. Table 3 shows the results of ablation study concerning different modules of our network. It can be observed that the feedback module achieves well with EdgeConv, and the performance of model 5 shows a further enhancement with the CAA module applied (i.e. AFME). Besides, the results of model 3/4/5 prove that the network benefits from implicit and explicit geometric features. However, it is worth noting that increasing low-level geometric information (NMD) alone may not improve performance due to the redundant features that may cause overfitting (model 5&6). Further, the model benefits once we augment the geometrics in both low and high-levels.

Naive Mesh Descriptor. Although the idea of adding explicit geometric features is simple and intuitive, by comparing models 7 and 8 in Table 3, we can find a 0.4% improvement. To illustrate further, we present another ablation study to investigate the best representation of the Naive Mesh Descriptor. Table 2 shows the results of various possible forms of the NMD. According to the experiments, we conclude that the formation of the NMD for model 5 works better since these terms can comprehensively represent the low-level geometric details of the estimated triangle face, including the vertex, face normal, and edges etc.

Figure 7: Examples of the features learned by AFME and FEFE modules in shallow and deep layers of our network on ModelNet40. Shallow layers focus on edges/corners while deep layers cover more semantically meaningful parts. Besides, FEFE and AFME can capture complementary features, which are crucial for comprehensive point clouds representations.

Visualization and complexity. From Figure 7 we can visualize corresponding learned features by AFME and FEFE modules in different layers of our network on ModelNet40. Particularly, all examples show the property of CNN: the shallow layers have higher impact on simpler features e.g. edges, corners, etc., while deep layers connect those simpler features for more semantically specific parts. As we stated before, AFME mainly extracts prominent features while FEFE asists to capture missing details. From the figure we can observe that FEFE complements AFME as expected.

Although we have similar operations to the competing methods e.g. FC layers, knn algorithm, etc., we manage to simplify the complexity by sharing weights, reducing dimensions, etc. The inference time of our model running on GeForce GTX 2080Ti is about 17.5ms. By comparing with other state-of-the-art methods under the same test conditions11 1 please refer to our supplementary material for more experimental results., our approach has a relatively good compromise between accuracy and model complexity. And we expect to further optimize the network for real-time applications.

4.2 Classification Performance

Results on synthetic point clouds. Table 1 shows the quantitative results on the synthetic ModelNet40 classification benchmark. The result of our network (overall acc: 93.8% and average class acc: 91.0%) exceeds state-of-the-art methods comparing under the same given input i.e. 1k coordinates only. It is worth mentioning that our approach is even better than some methods using extra input points e.g. DGCNN [43] with 2k inputs got overall acc: 93.5% and average class acc: 90.7%. Similarly, our algorithm outperforms SO-Net [12], which uses more information such as 5k inputs with normals, got an overall accuracy of 93.4%, and average class accuracy of 90.8%. We also got a higher score than RS-CNN [18], which uses post-processing that is a ten votes evaluation arrangement during testing. In terms of the network architecture itself, our approach is indicated to be promising and effective for classification.

Results on real-world point clouds. For real-world classification, we use the same network architecture, training strategy, as well as 1k of 3D coordinates as input. To have fair comparisons with state-of-the-art methods, we conduct the classification experiment with its most challenging variant22 2 PB_T50_RS, the hardest case of ScanobjectNN dataset as in [35]. We present Table 4 with the accuracies of competing methods on the real-world ScanObjectNN dataset. The results of our network with an overall accuracy of 80.5% and an average class accuracy of 77.8% have significantly improved the classification accuracy on the benchmark. We perform better than other methods in 7 out of 15 categories, and for hard cases like bag or display, we increase the accuracy by more than 10%. Furthermore, our approach performs even better than DGCNN [43] and PointNet++ [27] with background-aware network (BGA) [35] , which is designed explicitly for real-object point clouds.

Despite the fact that the ScanObjectNN dataset contains hard cases for point cloud classification, our method successfully showed its effectiveness and robustness1. As stated before, the point cloud analysis aims to solve practical problems. The excellent performance on real object dataset is a strong affirmation of our work.

5 Conclusion

In this paper, we propose a new CNN based module called the Attentional Feedback Module targeting some remaining problems of point cloud analysis: the feedback-like modules for edge features can automatically assist learning a better point cloud representation together with the Channel-wise Affinity Attention module that focuses on distinct channels. Besides, we involve more explicit geometrics using Naive Mesh Descriptor and implicit geometrics by Fine-grained Edge Feature Extractor. To compare our method with other state-of-the-art networks, we conduct experiments on both synthetic and real-world datasets. The results show the effectiveness and robustness of our approach.

References

  • [1] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [2] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
  • [3] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters, 3(4):3145–3152, 2018.
  • [4] François Blais et al. Review of 20 years of range sensor development. Journal of electronic imaging, 13(1):231–243, 2004.
  • [5] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
  • [6] Can Chen, Luca Zanotti Fragonara, and Antonios Tsourdos. Gapnet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705, 2019.
  • [7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
  • [8] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [9] Michel Jaboyedoff, Thierry Oppikofer, Antonio Abellán, Marc-Henri Derron, Alex Loye, Richard Metzger, and Andrea Pedrazzini. Use of lidar in landslide investigations: a review. Natural hazards, 61(1):5–28, 2012.
  • [10] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017.
  • [11] Marcel Körtgen, Gil-Joo Park, Marcin Novotni, and Reinhard Klein. 3d shape matching with 3d shape contexts. In The 7th central European seminar on computer graphics, volume 3, pages 5–17. Budmerice, 2003.
  • [12] Jiaxin Li, Ben M Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9397–9406, 2018.
  • [13] Ruihui Li, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-gan: A point cloud upsampling adversarial network. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [14] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2018.
  • [15] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
  • [16] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8778–8785, 2019.
  • [17] Yongcheng Liu, Bin Fan, Gaofeng Meng, Jiwen Lu, Shiming Xiang, and Chunhong Pan. Densepoint: Learning densely contextual representation for efficient point cloud processing. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [18] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
  • [19] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [20] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
  • [21] Quentin Mérigot, Maks Ovsjanikov, and Leonidas J Guibas. Voronoi-based curvature and feature estimation from point clouds. IEEE Transactions on Visualization and Computer Graphics, 17(6):743–756, 2010.
  • [22] Niloy J Mitra, Natasha Gelfand, Helmut Pottmann, and Leonidas Guibas. Registration of point cloud data from a geometric optimization perspective. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 22–31. ACM, 2004.
  • [23] Niloy J Mitra and An Nguyen. Estimating surface normals in noisy point cloud data. In Proceedings of the nineteenth annual symposium on Computational geometry, pages 322–328. ACM, 2003.
  • [24] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3316–3324, 2015.
  • [25] Anshul Paigwar, Ozgur Erkent, Christian Wolf, and Christian Laugier. Attentional pointnet for 3d-object detection in point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [26] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
  • [27] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
  • [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [29] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE International Conference on Robotics and Automation, pages 3212–3217. IEEE, 2009.
  • [30] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient ransac for point-cloud shape detection. In Computer graphics forum, volume 26, pages 214–226. Wiley Online Library, 2007.
  • [31] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017.
  • [32] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  • [33] Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E Siegel, and Sanjay E Sarma. Pointgrow: Autoregressively learned point cloud generation with self-attention. arXiv preprint arXiv:1810.05591, 2018.
  • [34] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [35] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [37] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [38] George Vosselman, Sander Dijkman, et al. 3d building model reconstruction from point clouds and ground plans. International archives of photogrammetry remote sensing and spatial information sciences, 34(3/W4):37–44, 2001.
  • [39] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–66, 2018.
  • [40] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10296–10305, 2019.
  • [41] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
  • [42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
  • [43] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019.
  • [44] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • [45] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [46] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2018.
  • [47] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pages 87–102, 2018.
  • [48] Wenxiao Zhang and Chunxia Xiao. Pcan: 3d attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12436–12445, 2019.
  • [49] Kang Zhiheng and Li Ning. Pyramnet: Point cloud pyramid attention network and graph embedding module for classification and segmentation. arXiv preprint arXiv:1906.03299, 2019.