Abstract
As the basic task of point cloud learning, classification is fundamental butalways challenging. To address some unsolved problems of existing methods, wepropose a network designed as a feedback mechanism, a procedure allowing themodification of the output via as a response to the output, to comprehensivelycapture the local features of 3D point clouds. Besides, we also enrich theexplicit and implicit geometric information of point clouds in lowlevel 3Dspace and highlevel feature space, respectively. By applying an attentionmodule based on channel affinity, that focuses on distinct channels, thelearned feature map of our network can effectively avoid redundancy. Theperformances on synthetic and realworld datasets demonstrate the superiorityand applicability of our network. Comparing with other stateoftheartmethods, our approach balances accuracy and efficiency.
Quick Read (beta)
Geometric Feedback Network for Point Cloud Classification
Abstract
As the basic task of point cloud learning, classification is fundamental but always challenging. To address some unsolved problems of existing methods, we propose a network designed as a feedback mechanism, a procedure allowing the modification of the output via as a response to the output, to comprehensively capture the local features of 3D point clouds. Besides, we also enrich the explicit and implicit geometric information of point clouds in lowlevel 3D space and highlevel feature space, respectively. By applying an attention module based on channel affinity, that focuses on distinct channels, the learned feature map of our network can effectively avoid redundancy. The performances on synthetic and realworld datasets demonstrate the superiority and applicability of our network. Comparing with other stateoftheart methods, our approach balances accuracy and efficiency.
1 Introduction
Point clouds are one of the fundamental representations of 3D data, and widely used for both academia and industry because of the development of 3D sensing technology and relevant applications. Generally, 3D point clouds can be collected by 3D scanners [4] utilizing physical touch or noncontact measurements e.g. light, sound, LiDAR etc. Particularly, LiDAR scanners [9] are in service in many areas including agriculture, biology, and robotics, etc. Due to its tremendous contributions, point cloud analysis attracts much interest for further investigation.
Previously, traditional algorithms [30, 22, 29, 38] that operate on 3D data incorporated estimated geometric information and reconstructed models. With the help of deep learning, recent works on 3D data focus on datadriven approaches via Convolutional Neural Networks (CNNs). Classical works can be generally categorized as: multiview images with 2D CNNs (e.g. MVCNN [32]), volumetric/mesh data with 3D CNNs (e.g. VoxNet [20]), 3D point cloud with multilayer perceptrons (MLP) (e.g. [26]), etc.
Currently, many works are exploring different methods to improve the processing of 3D point clouds, but there remains some unsolved issues: (1) How can we force the network to automatically learn a better representation from the abstract highlevel embedding space? (2) How do we refine the output features in order to focus on the crucial information? (3) Besides the regular features, can we learn from more geometric clues for a comprehensive analysis?
To investigate the possible answers to the above concerns, here we present a novel attentional feedback structure for point cloud learning via the incorporation of geometric context. As supported by substantial biological evidence, feedback mechanisms are very helpful for visual tasks, because feedback paths in our brain especially in the visual cortex consist of more neurons than the forward paths. Besides, feedback mechanisms also have been successfully utilized for stable and responsive systems in industry. Some early works managed to involve this mechanism in CNNs to deal with computer vision problems like 2D visual segmentation [5] or 3D hand pose estimation [24] etc. However, for point clouds, PUGAN [13] to some extent used a feedbacklike unit but for feature expansion inside its generator module. Our primary motivation to investigate feedback is to automatically refine the output feature map by comparing the difference between the input and the corresponding feedback signal. Using this approach, we enable the network to generate a better learning output.
Regarding highlevel feature presentation for 3D point clouds, another widely applied mechanism: Attention, can assist the network to put more emphasis on useful information [36]. Attention modules are used in many 2D visual problems (e.g. image segmentation [42, 8, 7], image denoising [1]), etc. For 3D point cloud analysis, we design a Channelwise Affinity Attention module for feature map enhancement based on affinity between channels.
To deal with the unorderedness of regular point cloud data, PointNet [26] proposed to use symmetric functions to aggregate features, while most of the subsequent works apply maxpooling to extract prominent features. Despite the fact that prominent features are representative, they lack some details especially in local areas, hence, are insufficient for precise classification task. To address this problem, we propose a simple but effective way to learn finegrained features via a shared fully connected (FC) layer on each local neighborhood. Although recent work [27, 18, 41, 47, 12] shows that CNN approaches can benefit from more geometric information, it could also negatively affect the performance since redundant or useless features can be incorporated. To maximize its advantage, we shall carefully form a lowlevel descriptor with explicit physical meaning to enrich the geometric information.
The contributions of our work can be summarized as:

•
We design a feedback CNN mechanism for local prominent feature learning on point clouds.

•
We introduce a Channelwise Affinity Attention module to better refine highlevel features of point clouds.

•
We propose an intuitive but effective structure to extract local finegrained features as complement. And we also show that our estimated mesh descriptor can significantly improve the performance of the network.

•
We present experimental results showing that our proposed network on both synthetic and realworld 3D point cloud classification benchmarks outperform stateoftheart methods.
2 Related Work
Estimating geometric relations. Although 3D scattered point cloud data has many advantages, the main drawback is the lack of geometric information. In order to acquire more underlying knowledge of point clouds, conventional methods [23, 21, 38] tried to estimate the geometry of point cloud e.g. face, normals, curvature, etc. and also proposed many handcrafted features for recognition and matching, (e.g. shape context[11], point histograms[29], etc.), Besides, recent works [27, 18, 43, 46] with CNNs have better performance thanks to the permutation invariance of lowlevel geometry. To give advantages for the needs of modern methods, here we expand lowlevel geometric information from the given 3D coordinates, then integrate this as an explicit geometric descriptor for network processing.
Learning local features. PointNet [26] passes point cloud data through MLPs for a highlevel feature representation of every single point, and successfully solved the unordered problem of point cloud data with a symmetric function. Due to the effectiveness, later works [27, 15, 43, 18] also adopt MLP based operation for point cloud processing. Meanwhile, researchers realize that local features are promising because they contribute additional characteristics to global features. Although the points are unordered in point cloud data, we may group points based on various metrics. Generally, one approach selects seed points as centroids, and then applies a query algorithm (e.g. Ball Query in [27]) based on 3D Euclidean distance to group points for local clusters. After extracting local features, the network may further process the centroids.
Another track is to find each point’s neighbors in embedding space based on Ndimensional Euclidean distance [43] and then group each point’s neighbors in the form of high dimensional vectors. In contrast to the previous type, this method can avoid sparsity and update dynamically in different feature dimensions. In terms of feature aggregation, maxpooling [26] is widely employed since it can solve the issue of unorderedness and gather information sufficiently. In spite of the benefits, there are some weak points of the current max pooling approach: it may lose local details or involve bias. To overcome these problems, we add complementary local finegrained features and feedback structures to reduce possible bias.
Attention mechanism for CNN. The idea of attention has been successfully used in many areas of Artificial Intelligence (AI). Like human beings, the computational resource of machine is also limited. Thus we need to focus on important aspects. Previously, Vaswani et al. [36] proposed different types of attention mechanisms for neural machine translation. Subsequently, attention mechanisms were incorporated in visual tasks, for example, Wang et al. [42] extended the idea of Selfattention in spatial domain for computer vision problems. Also, SENet [8] credits winning the ImageNet [28] challenge to its channelwise attention module. Other works [44, 7, 14] derive benefits from both spatial and channel domains of 2D images.
In terms of 3D point clouds, attention modules contribute to point clouds detection [25], generation [33], segmentation [48, 16, 49], etc. However, limited work is done in welldesigned attention mechanisms targeting 3D point clouds classification. On this front, Xie et al. [46] utilized a spatial selfattention module for the shape context [11] of point clouds. Subsequent works [40, 6] also applied Graph Attention [37] module for the constructed graph features on point clouds. Differently from existing methods, we try to enhance the highlevel representation of point cloud by capturing the longrange dependencies along its channels.
3 Approach
As stated in Section 1, there are some unsolved problems of the existing methods. To tackle the challenges, we start from classical $MLP$ mapping [26] and edge convolution [43]. We apply the basic $MLP$ operation [26] i.e. a 1$\times $1 convolution on the point cloud feature map, usually followed by a batch normalization layer and an activation function:
$$\mathcal{M}(\cdot ):=\tau \left(\bm{B}\bm{N}\left({c}_{1\times 1}(\cdot )\right)\right),$$  (1) 
where $\mathcal{M}$ is the $MLP$, $\tau $ is the activation function, $\bm{B}\bm{N}$ is batch normalization, $c$ is convolution and its subscript presents the filter size. Moreover, edge features ${f}_{\psi}$ defined in [43] is given by:
$${f}_{\psi}({x}_{i})=({x}_{i},{x}_{j}{x}_{i}),{x}_{i}\in {\mathbb{R}}^{d};\forall {x}_{j}\in Neighbors({x}_{i})$$  (2) 
where ${x}_{i}$ is the centroid of a local area in $d$dimensional feature space, and the entire feature map can be formed as ${\mathcal{X}}_{N\times d}=[x_{1}{}^{T},x_{2}{}^{T},\mathrm{\dots},x_{N}{}^{T}]$. The ${x}_{j}$’s are the neighbors found by knearest neighbors ($knn$) algorithm.
The aim of $EdgeConv$, $\mathcal{E}$, is to construct radial local graphs (i.e. edge features ${f}_{\psi}$) consisting of edges pointing from the neighbors to the centroids, and then mapped by a shared $MLP$ in feature space:
$$\mathcal{E}(\cdot ):=\mathcal{M}({f}_{\psi}(\cdot ))$$  (3) 
Differently from these approaches, we attempt to incorporate more geometric features in different levels. As illustrated in Figure 2, our network has a series of modules with each consisting of two main parts: one extracting prominent features and the other learning finegrained features. Our mesh descriptor explicitly expands geometric information in lowlevel space to meet the demand of comprehensive learning. In this section, we introduce the critical modules in detail and mathematically formulate the operations.
3.1 Attentional Feedback Module for EdgeConv
The typical feedback mechanism aims to accurately control a process by monitoring its actual output and feeding the error signal back to force the system to generate the desired output. To be more specific, the output is passed through a feedback path as a feedback signal, and then the forward path can use such feedback signals to both adjust and control the system. By forming such a closed loop, the system can reduce the error, improve stability, and enhance robustness. Inspired by this, we propose the Attentional Feedback Module for EdgeConv ($AFME$) illustrated in Figure 3 for edgebased prominent features.
Forward path. Here we employ $EdgeConv$ as the forward path denoted as $\mathbf{\Phi}$ of our $AFME$ since this operation can explicitly capture both global shape structure and local neighborhood information. With Equation 2 and 3, the forward path as can be formulated as:
$$\mathbf{\Phi}({\varphi}_{i},{x}_{i})=\mathcal{M}({f}_{\psi}({x}_{i}))=\mathcal{M}({x}_{i},{x}_{j}{x}_{i})$$  (4) 
And the output of forward path is:
$${f}_{{\mathrm{\Phi}}_{i}}=\mathbf{\Phi}({\varphi}_{i},{x}_{i});{f}_{{\mathrm{\Phi}}_{i}}\in {\mathbb{R}}^{{d}^{\prime}\times k}$$  (5) 
Feedback and error signals. Following our forward path $\mathbf{\Phi}$, an ideal output feature map should fully encode global and local details. On the other hand, if the output ${f}_{{\mathrm{\Phi}}_{i}}$ is indeed informative, we may restore the original input ${x}_{i}$ from it. Suppose our feedback path ($\mathbf{\Upsilon}$) aims to restore the input, then $\mathbf{\Upsilon}$’s output, $x_{i}{}^{\prime}$, can be termed as a $FeedbackSignal$:
$x_{i}{}^{\prime}$  $=\mathbf{\Upsilon}({\upsilon}_{i},{f}_{{\mathrm{\Phi}}_{i}})$  (6)  
$=\mathbf{\Upsilon}({\upsilon}_{i},\mathbf{\Phi}({\varphi}_{i},{x}_{i}));x_{i}{}^{\prime}\in {\mathbb{R}}^{d}$ 
Further, we take the difference between the original and restored inputs, $\mathrm{\Delta}{x}_{i}$, to formulate corresponding $ErrorSignal$:
$\mathrm{\Delta}{x}_{i}$  $=x_{i}{}^{\prime}{x}_{i}$  (7)  
$=\mathbf{\Upsilon}({\upsilon}_{i},\mathbf{\Phi}({\varphi}_{i},{x}_{i})){x}_{i};\mathrm{\Delta}{x}_{i}\in {\mathbb{R}}^{d}$ 
Feedback path. Since we attempt to restore the input from ${f}_{{\mathrm{\Phi}}_{i}}\in {\mathbb{R}}^{{d}^{\prime}\times k}$, the feedback path needs to simulate a reverse process of the forward path (i.e. $EdgeConv$). In our proposal, we apply a shared Local Fully Connected ($LFC$) layer as the feedback path $\mathbf{\Upsilon}$ of our module:
$$\mathcal{L}(\cdot ):=\tau \left(\bm{B}\bm{N}\left({c}_{1\times k}(\cdot )\right)\right)$$  (8) 
where $\mathcal{L}$ is the $LFC$. Mathematically, $LFC$ is a special case of a shared $MLP$ with kernel size $[1,k]$, by which the $k$ neighbors in feature space can be fully connected. In contrast to $EdgeConv$ expanding a center point to local neighbors in the new embedding space, the shared $LFC$ layer can pull the neighbors back to the previous space $({d}^{\prime}\to d)$ and aggregate at a center point $(k\to 1)$ via learnable weights. In general, the feedback path follows:
$$\mathbf{\Upsilon}({\upsilon}_{i},{f}_{{\mathrm{\Phi}}_{i}})=\mathcal{L}({f}_{{\mathrm{\Phi}}_{i}})$$  (9) 
Based on Equations 5, 6 and 7, we can rewrite the output of feedback path, i.e. $FeedbackSignal$ as:
$$x_{i}{}^{\prime}=\mathcal{L}(\mathbf{\Phi}({\varphi}_{i},{x}_{i}));x_{i}{}^{\prime}\in {\mathbb{R}}^{d}$$  (10) 
and $ErrorSignal$ as:
$\mathrm{\Delta}{x}_{i}=\mathcal{L}(\mathbf{\Phi}({\varphi}_{i},{x}_{i})){x}_{i};\mathrm{\Delta}{x}_{i}\in {\mathbb{R}}^{d}$  (11) 
Correction path. Finally, the module passes the $ErrorSignal$ through a correction path ($\mathbf{\Gamma}$), which has the same structure with forward path:
$$\mathbf{\Gamma}({\gamma}_{i},\mathrm{\Delta}{x}_{i})=\mathcal{E}(\mathrm{\Delta}{x}_{i})=\mathcal{M}({f}_{\psi}(\mathrm{\Delta}{x}_{i}))$$  (12) 
Therefore, the output features of the correction path can be formed as:
$${f}_{{\mathrm{\Gamma}}_{i}}=\mathbf{\Gamma}({\gamma}_{i},\mathrm{\Delta}{x}_{i});{f}_{{\mathrm{\Gamma}}_{i}}\in {\mathbb{R}}^{{d}^{\prime}\times k}$$  (13) 
In order to form the feedback loop, here we take the output of the correction path as the correction term for our original output of forward path. After that, we apply maxpooling over the local area to obtain a compact feature map. Moreover, the Channelwise Affinity Attention ($CAA$, see Section 3.2 for details) module can further refine the final feature representation.
Finally, the operation of Attentional Feedback Module for EdgeConv ($AFME$) can be summarized as:
${f}_{i}$  $=\bm{A}\bm{F}\bm{M}\bm{E}({x}_{i})$  (14)  
$=\bm{C}\bm{A}\bm{A}\left(\underset{\{k\}}{\mathrm{max}}({f}_{{\mathrm{\Phi}}_{i}}+{f}_{{\mathrm{\Gamma}}_{i}})\right);{f}_{i}\in {\mathbb{R}}^{{d}^{\prime}}$ 
3.2 Channelwise Affinity Attention Module
As mentioned in Section 2, most attention designs regarding point clouds operate in pointspace, but the effects are not apparent. Instead, we prefer distributing attention weights along channels. Inspired by the spacetime nonlocal block [42], we can calculate the longrange dependencies without being concerned by point cloud data’s unoderedness. However, the corresponding calculations also have a high computational cost. Therefore we ought to find an appropriate method to avoid redundancy and refine the information in an abstract embedding space effectively and efficiently.
We propose our Channelwise Affinity Attention ($CAA$) module targeting the channels of highlevel point cloud feature maps. As Figure 4 shows, the main structure of the $CAA$ module includes a Compact Channelwise Comparator ($CCC$) block, a Channel Affinity Estimator ($CAE$) block, and a residual connection.
Compact Channelwise Comparator block. Since the $CAA$ module mainly focuses on channels, it is necessary to reduce the computing cost caused by the complexity in pointspace. As we claimed above, it is hard to select the key points in such abstract high dimensional space. In the case of a given $d$dimensional feature map ${\mathcal{F}}_{N\times d}$, the Compact Channelwise Comparator ($CCC$) block can simplify context in each channel by an shared $MLP$ operating on channel vector ${c}_{i}(where{c}_{i}\in {\mathbb{R}}^{N};and{\mathcal{F}}_{N\times d}=[{c}_{1},{c}_{2},\mathrm{\dots},{c}_{d}]$) to implicitly replace $N$ original points with a smaller number ${N}^{\prime}=N/ratio;ratio>1$. In contrast to explicitly selecting some points in abstract embedding space, $CCC$ aims to efficiently reduce the size but sufficiently retain the information of each channel:
$${q}_{i}={\mathcal{M}}_{q}({c}_{i});{q}_{i}\in {\mathbb{R}}^{{N}^{\prime}}$$ 
$${k}_{i}={\mathcal{M}}_{k}({c}_{i});{k}_{i}\in {\mathbb{R}}^{{N}^{\prime}}$$ 
Specifically, ${\mathcal{M}}_{q}(\cdot )$ and ${\mathcal{M}}_{k}(\cdot )$ are two $MLP$s operating for $QueryMatrix$ and $KeyMatrix$ [36]:
$${\mathcal{Q}}_{{N}^{\prime}\times d}=[{q}_{1},{q}_{2},\mathrm{\dots},{q}_{d}]$$ 
$${\mathcal{K}}_{{N}^{\prime}\times d}=[{k}_{1},{k}_{2},\mathrm{\dots},{k}_{d}]$$ 
and we apply the product of transposed $QueryMatrix$ and $KeyMatrix$ to estimate corresponding channelwise $SimilarityMatrix$:
$${\mathcal{S}}_{d\times d}={\mathcal{Q}}^{T}\mathcal{K}$$ 
where ${\mathcal{S}}_{i,j}$ approximates the similarity between the ${i}^{th}$ channel and the ${j}^{th}$ channel of the given feature map ${\mathcal{F}}_{N\times d}$.
Channel Affinity Estimator block. Typical selfattention structures used to calculate the longrange dependencies in spatial data based on innerproducts, since the values can somehow represent the similarities between the items. In contrast, we define the nonsimilarities between channels and term it Channel Affinity. In our approach, the Channel Affinity Matrix of the feature map ${\mathcal{F}}_{N\times d}$, can be modeled:
$${\mathcal{A}}_{d\times d}=\bm{s}\bm{o}\bm{f}\bm{t}\bm{m}\bm{a}\bm{x}\left(\underset{1\to d}{\bm{e}\bm{x}\bm{p}\bm{a}\bm{n}\bm{d}}\left(\underset{d\to 1}{\mathrm{max}}(\mathcal{S})\right){\mathcal{S}}_{d\times d}\right)$$  (15) 
Particularly, we select the maximum similarities along the columns of $\mathcal{S}$, and then expand them into the same size of $\mathcal{S}$. By subtracting the original $\mathcal{S}$ from the expanded matrix, the channels with higher similarities have lower affinities (illustrated in Figure 5(b)). Besides, $\bm{s}\bm{o}\bm{f}\bm{t}\bm{m}\bm{a}\bm{x}$ is added to normalize the values, since ${\mathcal{A}}_{d\times d}$ is used as the weight matrix for refinement. In this way, channels can put higher weights on other distinct channels, thereby avoid aggregating similar/redundant information.
According to the weight matrix, we can refine each point’s features by taking the weighted sum of all channels. We apply another $MLP$, ${\mathcal{M}}_{v}(\cdot )$, to get the $ValueMatrix$ as shown below:
$${\mathcal{V}}_{N\times d}=[{v}_{1},{v}_{2},\mathrm{\dots},{v}_{d}]$$ 
$${v}_{i}={\mathcal{M}}_{v}({c}_{i});{v}_{i}\in {\mathbb{R}}^{N}$$ 
This process can be easily achieved by the multiplication between ${\mathcal{V}}_{N\times d}$ and the Channel Affinity Matrix. Additionally, we use a residual connection and learn a weight $\alpha $ to ease block training. The refined feature map by $CAA$ is given below:
$$\mathcal{F}^{\prime}{}_{N\times d}=\bm{C}\bm{A}\bm{A}(\mathcal{F})=\mathcal{F}+\alpha \cdot \mathcal{V}\mathcal{A}$$  (16) 
3.3 Geometric Features
For regular scattered point clouds, the given information about 3D coordinates is minimal. In our work, we attempt to enrich the geometric features of the point cloud from two aspects: (1) we describe the lowlevel relations explicitly, and (2) extract the highlevel information implicitly.
Explicit geometric features. Here we define the explicit geometric features in lowlevel space to estimate features with explicit purposes. In geometry, a mesh is a type of wellconstructed 3D data format, including faces, edges, as well as vertices. Similarly, we incorporate the estimated faces and edges to expand the lowlevel features representation of the 3D point clouds. Hence, we propose a Naive Mesh Descriptor ($NMD$) to enrich the original input data (i.e. 3D coordinates) with estimated face features.
Since most of the mesh data is constructed in triangle faces, we also adapt the point cloud for triangle mesh format. To be specific, firstly we search the two nearest neighbors, i.e. $knn$ with $k=2$, in 3D space for point ${p}_{i}\in {\mathbb{R}}^{3}$, and then we form the triangle face corresponding to ${p}_{i}$ with the two neighbors: ${p}_{j1},{p}_{j2}\in {\mathbb{R}}^{3}$. To explicitly describe the estimated triangle face, totally six items with exact geometric purposes are involved in:
$${\stackrel{~}{p}}_{i}=({p}_{i},normal,edg{e}_{1},edg{e}_{2},lengt{h}_{1},lengt{h}_{2})$$  (17) 
Concretely:
$${\stackrel{~}{p}}_{i}\in {\mathbb{R}}^{14}\{\begin{array}{cc}{p}_{i}=(x,y,z)\hfill & {p}_{i}\in {\mathbb{R}}^{3}\hfill \\ normal=edg{e}_{1}\times edg{e}_{2}\hfill & normal\in {\mathbb{R}}^{3}\hfill \\ edg{e}_{1}={p}_{j1}{p}_{i}\hfill & edg{e}_{1}\in {\mathbb{R}}^{3}\hfill \\ edg{e}_{2}={p}_{j2}{p}_{i}\hfill & edg{e}_{2}\in {\mathbb{R}}^{3}\hfill \\ lengt{h}_{1}=edg{e}_{1}\hfill & lengt{h}_{1}\in {\mathbb{R}}^{1}\hfill \\ lengt{h}_{2}=edg{e}_{2}\hfill & lengt{h}_{2}\in {\mathbb{R}}^{1}\hfill \end{array}$$ 
Implicit geometric features. In contrast to explicit geometry in lowlevel space, we also expect to capture more implicit information in highlevel space. As explained in Section 3.1, the Attentional Feedback Module for EdgeConv ($AFME$) can extract local prominent features via a maxpooling function in a highlevel space. Although the prominent features can encode much geometric information for simple point clouds, more details are needed. Especially for some challenging cases e.g. real objects, complex scenes, or similar shapes etc., more finegrained features are required for comprehensive feature representation.
Specificly, the edge features ${f}_{\psi}({x}_{i})$ from a $AFME$ are employed as the input of corresponding Finegrained Edge Feature Extractor ($FEFE$, Figure 6). Instead of maxpooling for prominent features, the $LFC$ layer can aggregate more details from all neighbors. Besides, $CAA$ helps to refine the features for compact outputs. Therefore, the extracted finegrained features are formulated:
$${\stackrel{~}{f}}_{i}=\bm{F}\bm{E}\bm{F}\bm{E}({f}_{\psi}({x}_{i}))=\bm{C}\bm{A}\bm{A}(\mathcal{L}({f}_{\psi}({x}_{i})))$$  (18) 
4 Experiments
method  input type  #points  avg class acc.  overall acc. 
ECC [31]  $coords$  $1k$  83.2  87.4 
PointNet [26]  $coords$  $1k$  86.0  89.2 
SCN [46]  $coords$  $1k$  87.6  90.0 
KdNet [10]  $coords$  $1k$    90.6 
PointCNN [15]  $coords$  $1k$  88.1  92.2 
PCNN [2]  $coords$  $1k$    92.3 
DensePoint [17]  $coords$  $1k$    92.8 
RSCNN [18]  $coords$  $1k$    92.9 
DGCNN [43]  $coords$  $1k$  90.2  92.9 
KPConv [34]  $coords$  $1k$    92.9 
Ours  $\bm{c}\bm{o}\bm{o}\bm{r}\bm{d}\bm{s}$  $\mathrm{\U0001d7cf}\bm{k}$  91.0  93.8 
SONet [12]  $coords$  $2k$  87.3  90.9 
PointNet++ [27]  $coords+norm$  $5k$    91.9 
SpecGCN [39]  $coords+norm$  $2k$    92.1 
SpiderCNN [47]  $coords+norm$  $5k$    92.4 
DensePoint [17]  $coords+voting$  $1k$    93.2 
SONet [12]  $coords+norm$  $5k$  90.8  93.4 
DGCNN [43]  $coords$  $2k$  90.7  93.5 
RSCNN [18]  $coords+voting$  $1k$    93.6 
In this section, we first provide the implementation and training details followed by the datasets we utilize for evaluation. We then analyze our network to establish the effects of different modules. Furthermore, we visualize the outputs, discuss the complexity of our model, and conclude this with the performance against stateoftheart methods on synthetic and realworld point clouds.
Implementation details. Our proposed network starts with a Naive Mesh Descriptor, which expands the input 3D coordinates into a 14degree geometric vector. Next, the geometric features are passed through four modules to learn highlevel features in different embedding spaces i.e. 64, 64, 128, and 256. Moreover, each module has an Attentional Feedback Module for EdgeConv ($AFME$, the number of neighbors $k$ is 20) for extracting prominent features and a Finegrained Edge Feature Extractor ($FEFE$) for local details.
To incorporate the information from different scales, we concatenate the output feature maps of the mentioned modules together, and a shared $MLP$ with $CAA$ module can further integrate them into a 1024 dimensional feature map. Then we apply maxpooling and averagepooling in parallel over all channels for a global vector, by which an additional three fully connected layers (having 512, 256, $c$ output) can regress the confidence scores for all possible categories. In the end, we employ crossentropy between predictions and groundtruth labels as our loss function.
\Xhline3 model  Naive Mesh Descriptor  length  overall acc. 
1  ${\stackrel{~}{p}}_{i}=({p}_{i})$  3  93.4 
2  ${\stackrel{~}{p}}_{i}=({p}_{i},n,{l}_{1},{l}_{2})$  8  93.5 
3  ${\stackrel{~}{p}}_{i}=({p}_{i},{e}_{1},{e}_{2},{l}_{1},{l}_{2})$  11  93.5 
4  ${\stackrel{~}{p}}_{i}=({p}_{i},n,{e}_{1},{e}_{2})$  12  93.7 
5  ${\stackrel{~}{p}}_{i}=({p}_{i},n,{e}_{1},{e}_{2},{l}_{1},{l}_{2})$  14  93.8 
\Xhline3 
Training. We apply Stochastic Gradient Descent (SGD) with the momentum of 0.9 as the optimizer for training, and its initial learning rate of 0.1 decreases to 0.001 by cosine annealing [19]. The batch size is set to 32, and the number of training epochs is 300. Besides, we augment the training data with random scaling and translation as in [43], while there is no pre or postprocessing performed during testing.
Datasets. We show the performance of the proposed network on two classification datasets: a classical ModelNet40 [45], which contains synthetic object point clouds, and the recently introduced ScanObjectNN [35] composed of realworld object point clouds.

•
ModelNet40. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular because of its various categories, clean shapes, wellconstructed dataset, etc. To be specific, the original ModelNet40 consists of 12,311 CADgenerated meshes in 40 categories, of which 9,843 are used for training while the rest 2,468 are reserved for testing. Moreover, the corresponding point cloud data points are uniformly sampled from the mesh surfaces, and then further preprocessed by moving to the origin and scaling into a unit sphere. For our experiments, we only input the $(x,y,z)$ coordinates having 1024 points for each 3D point cloud.

•
ScanObjectNN. To further prove the effectiveness and robustness of our classification network, we conduct experiments on ScanObjectNN, a newly published realworld object dataset that has about 15,000 objects in 15 categories. Although it has fewer categories than ModelNet40, it is more practically challenging than its synthetic counterpart due to the background, missing parts, and various deformations.
4.1 Ablation studies
To verify the functions and effectiveness of different parts in our network, here we conduct two ablation studies about the proposed modules and the contents of Naive Mesh Descriptor, respectively. We investigate the same proposed network on the ModelNet40 dataset.
\Xhline3 model  $FME$  $CAA$  $FEFE$  $NMD$  overall acc. 
baseline  92.6  
1  ✓  92.9  
2  ✓  93.0  
3  ✓  ✓  93.2  
4  ✓  ✓  ✓  93.4  
5  ✓  ✓  93.3  
6  ✓  ✓  ✓  93.1  
7  ✓  ✓  ✓  93.4  
8  ✓  ✓  ✓  ✓  93.8 
\Xhline3 
overall acc.  avg class acc.  bag  bin  box  cabinet  chair  desk  display  door  shelf  table  bed  pillow  sink  sofa  toilet  
# shapes      298  794  406  1344  1585  592  678  892  1084  922  564  405  469  1058  325 
3DmFV [3]  63  58.1  39.8  62.8  15.0  65.1  84.4  36.0  62.3  85.2  60.6  66.7  51.8  61.9  46.7  72.4  61.2 
PointNet [26]  68.2  63.4  36.1  69.8  10.5  62.6  89.0  50.0  73.0  93.8  72.6  67.8  61.8  67.6  64.2  76.7  55.3 
SpiderCNN [47]  73.7  69.8  43.4  75.9  12.8  74.2  89.0  65.3  74.5  91.4  78.0  65.9  69.1  80.0  65.8  90.5  70.6 
PointNet++ [27]  77.9  75.4  49.4  84.4  31.6  77.4  91.3  74.0  79.4  85.2  72.6  72.6  75.5  81.0  80.8  90.5  85.9 
DGCNN [43]  78.1  73.6  49.4  82.4  33.1  83.9  91.8  63.3  77.0  89.0  79.3  77.4  64.5  77.1  75.0  91.4  69.4 
PointCNN [15]  78.5  75.1  57.8  82.9  33.1  83.6  92.6  65.3  78.4  84.8  84.2  67.4  80.0  80.0  72.5  91.9  71.8 
Ours  80.5  77.8  59.0  84.4  44.4  78.2  92.1  66  91.2  91.0  86.7  70.4  82.7  78.1  72.5  92.4  77.6 
Effects of different modules. Table 3 shows the results of ablation study concerning different modules of our network. It can be observed that the feedback module achieves well with $EdgeConv$, and the performance of model 5 shows a further enhancement with the $CAA$ module applied (i.e. $AFME$). Besides, the results of model 3/4/5 prove that the network benefits from implicit and explicit geometric features. However, it is worth noting that increasing lowlevel geometric information ($NMD$) alone may not improve performance due to the redundant features that may cause overfitting (model 5&6). Further, the model benefits once we augment the geometrics in both low and highlevels.
Naive Mesh Descriptor. Although the idea of adding explicit geometric features is simple and intuitive, by comparing models 7 and 8 in Table 3, we can find a 0.4% improvement. To illustrate further, we present another ablation study to investigate the best representation of the Naive Mesh Descriptor. Table 2 shows the results of various possible forms of the $NMD$. According to the experiments, we conclude that the formation of the $NMD$ for model 5 works better since these terms can comprehensively represent the lowlevel geometric details of the estimated triangle face, including the vertex, face normal, and edges etc.
Visualization and complexity. From Figure 7 we can visualize corresponding learned features by $AFME$ and $FEFE$ modules in different layers of our network on ModelNet40. Particularly, all examples show the property of CNN: the shallow layers have higher impact on simpler features e.g. edges, corners, etc., while deep layers connect those simpler features for more semantically specific parts. As we stated before, $AFME$ mainly extracts prominent features while $FEFE$ asists to capture missing details. From the figure we can observe that $FEFE$ complements $AFME$ as expected.
Although we have similar operations to the competing methods e.g. $FC$ layers, $knn$ algorithm, etc., we manage to simplify the complexity by sharing weights, reducing dimensions, etc. The inference time of our model running on GeForce GTX 2080Ti is about 17.5ms. By comparing with other stateoftheart methods under the same test conditions^{1}^{1} 1 please refer to our supplementary material for more experimental results., our approach has a relatively good compromise between accuracy and model complexity. And we expect to further optimize the network for realtime applications.
4.2 Classification Performance
Results on synthetic point clouds. Table 1 shows the quantitative results on the synthetic ModelNet40 classification benchmark. The result of our network (overall acc: 93.8% and average class acc: 91.0%) exceeds stateoftheart methods comparing under the same given input i.e. 1k coordinates only. It is worth mentioning that our approach is even better than some methods using extra input points e.g. DGCNN [43] with 2k inputs got overall acc: 93.5% and average class acc: 90.7%. Similarly, our algorithm outperforms SONet [12], which uses more information such as 5k inputs with normals, got an overall accuracy of 93.4%, and average class accuracy of 90.8%. We also got a higher score than RSCNN [18], which uses postprocessing that is a ten votes evaluation arrangement during testing. In terms of the network architecture itself, our approach is indicated to be promising and effective for classification.
Results on realworld point clouds. For realworld classification, we use the same network architecture, training strategy, as well as 1k of 3D coordinates as input. To have fair comparisons with stateoftheart methods, we conduct the classification experiment with its most challenging variant^{2}^{2} 2 PB_T50_RS, the hardest case of ScanobjectNN dataset as in [35]. We present Table 4 with the accuracies of competing methods on the realworld ScanObjectNN dataset. The results of our network with an overall accuracy of 80.5% and an average class accuracy of 77.8% have significantly improved the classification accuracy on the benchmark. We perform better than other methods in 7 out of 15 categories, and for hard cases like bag or display, we increase the accuracy by more than 10%. Furthermore, our approach performs even better than DGCNN [43] and PointNet++ [27] with backgroundaware network (BGA) [35] , which is designed explicitly for realobject point clouds.
Despite the fact that the ScanObjectNN dataset contains hard cases for point cloud classification, our method successfully showed its effectiveness and robustness1. As stated before, the point cloud analysis aims to solve practical problems. The excellent performance on real object dataset is a strong affirmation of our work.
5 Conclusion
In this paper, we propose a new CNN based module called the Attentional Feedback Module targeting some remaining problems of point cloud analysis: the feedbacklike modules for edge features can automatically assist learning a better point cloud representation together with the Channelwise Affinity Attention module that focuses on distinct channels. Besides, we involve more explicit geometrics using Naive Mesh Descriptor and implicit geometrics by Finegrained Edge Feature Extractor. To compare our method with other stateoftheart networks, we conduct experiments on both synthetic and realworld datasets. The results show the effectiveness and robustness of our approach.
References
 [1] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [2] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
 [3] Yizhak BenShabat, Michael Lindenbaum, and Anath Fischer. 3dmfv: Threedimensional point cloud classification in realtime using convolutional neural networks. IEEE Robotics and Automation Letters, 3(4):3145–3152, 2018.
 [4] François Blais et al. Review of 20 years of range sensor development. Journal of electronic imaging, 13(1):231–243, 2004.
 [5] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
 [6] Can Chen, Luca Zanotti Fragonara, and Antonios Tsourdos. Gapnet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705, 2019.
 [7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
 [8] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [9] Michel Jaboyedoff, Thierry Oppikofer, Antonio Abellán, MarcHenri Derron, Alex Loye, Richard Metzger, and Andrea Pedrazzini. Use of lidar in landslide investigations: a review. Natural hazards, 61(1):5–28, 2012.
 [10] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017.
 [11] Marcel Körtgen, GilJoo Park, Marcin Novotni, and Reinhard Klein. 3d shape matching with 3d shape contexts. In The 7th central European seminar on computer graphics, volume 3, pages 5–17. Budmerice, 2003.
 [12] Jiaxin Li, Ben M Chen, and Gim Hee Lee. Sonet: Selforganizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9397–9406, 2018.
 [13] Ruihui Li, Xianzhi Li, ChiWing Fu, Daniel CohenOr, and PhengAnn Heng. Pugan: A point cloud upsampling adversarial network. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [14] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2018.
 [15] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on xtransformed points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
 [16] Xinhai Liu, Zhizhong Han, YuShen Liu, and Matthias Zwicker. Point2sequence: Learning the shape representation of 3d point clouds with an attentionbased sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8778–8785, 2019.
 [17] Yongcheng Liu, Bin Fan, Gaofeng Meng, Jiwen Lu, Shiming Xiang, and Chunhong Pan. Densepoint: Learning densely contextual representation for efficient point cloud processing. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [18] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong Pan. Relationshape convolutional neural network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
 [19] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 [20] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
 [21] Quentin Mérigot, Maks Ovsjanikov, and Leonidas J Guibas. Voronoibased curvature and feature estimation from point clouds. IEEE Transactions on Visualization and Computer Graphics, 17(6):743–756, 2010.
 [22] Niloy J Mitra, Natasha Gelfand, Helmut Pottmann, and Leonidas Guibas. Registration of point cloud data from a geometric optimization perspective. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 22–31. ACM, 2004.
 [23] Niloy J Mitra and An Nguyen. Estimating surface normals in noisy point cloud data. In Proceedings of the nineteenth annual symposium on Computational geometry, pages 322–328. ACM, 2003.
 [24] Markus Oberweger, Paul Wohlhart, and Vincent Lepetit. Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3316–3324, 2015.
 [25] Anshul Paigwar, Ozgur Erkent, Christian Wolf, and Christian Laugier. Attentional pointnet for 3dobject detection in point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
 [26] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 [27] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
 [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 [29] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE International Conference on Robotics and Automation, pages 3212–3217. IEEE, 2009.
 [30] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient ransac for pointcloud shape detection. In Computer graphics forum, volume 26, pages 214–226. Wiley Online Library, 2007.
 [31] Martin Simonovsky and Nikos Komodakis. Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3693–3702, 2017.
 [32] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
 [33] Yongbin Sun, Yue Wang, Ziwei Liu, Joshua E Siegel, and Sanjay E Sarma. Pointgrow: Autoregressively learned point cloud generation with selfattention. arXiv preprint arXiv:1810.05591, 2018.
 [34] Hugues Thomas, Charles R. Qi, JeanEmmanuel Deschaud, Beatriz Marcotegui, Francois Goulette, and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [35] Mikaela Angelina Uy, QuangHieu Pham, BinhSon Hua, Thanh Nguyen, and SaiKit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on realworld data. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
 [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 [37] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 [38] George Vosselman, Sander Dijkman, et al. 3d building model reconstruction from point clouds and ground plans. International archives of photogrammetry remote sensing and spatial information sciences, 34(3/W4):37–44, 2001.
 [39] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–66, 2018.
 [40] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10296–10305, 2019.
 [41] PengShuai Wang, Yang Liu, YuXiao Guo, ChunYu Sun, and Xin Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
 [42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
 [43] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):146, 2019.
 [44] Sanghyun Woo, Jongchan Park, JoonYoung Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
 [45] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
 [46] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2018.
 [47] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision (ECCV), pages 87–102, 2018.
 [48] Wenxiao Zhang and Chunxia Xiao. Pcan: 3d attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12436–12445, 2019.
 [49] Kang Zhiheng and Li Ning. Pyramnet: Point cloud pyramid attention network and graph embedding module for classification and segmentation. arXiv preprint arXiv:1906.03299, 2019.