Abstract
Graph Neural Networks (GNNs) have become a topic of intense research recentlydue to their powerful capability in highdimensional classification andregression tasks for graphstructured data. However, as GNNs typically definethe graph convolution by the orthonormal basis for the graph Laplacian, theysuffer from high computational cost when the graph size is large. This paperintroduces the Haar basis, a sparse and localized orthonormal system for graph,constructed from a coarsegrained chain on the graph. The graph convolutionunder Haar basis  the Haar convolution can be defined accordingly for GNNs.The sparsity and locality of the Haar basis allow Fast Haar Transforms (FHTs)on graph, by which a fast evaluation of Haar convolution between the graphsignals and the filters can be achieved. We conduct preliminary experiments onGNNs equipped with Haar convolution, which can obtain stateoftheart resultsfor a variety of geometric deep learning tasks.
Quick Read (beta)
Haar Transforms for Graph Neural Networks
Abstract
Graph Neural Networks (GNNs) have become a topic of intense research recently due to their powerful capability in highdimensional classification and regression tasks for graphstructured data. However, as GNNs typically define the graph convolution by the orthonormal basis for the graph Laplacian, they suffer from high computational cost when the graph size is large. This paper introduces the Haar basis, a sparse and localized orthonormal system for graph, constructed from a coarsegrained chain on the graph. The graph convolution under Haar basis — the Haar convolution can be defined accordingly for GNNs. The sparsity and locality of the Haar basis allow Fast Haar Transforms (FHTs) on graph, by which a fast evaluation of Haar convolution between the graph signals and the filters can be achieved. We conduct preliminary experiments on GNNs equipped with Haar convolution, which can obtain stateoftheart results for a variety of geometric deep learning tasks.
I Introduction
Convolutional neural networks (CNNs) have been a very successful machinery in many highdimensional regression and classification tasks on Euclidean domains [1, 2]. Recently, its generalization to nonEuclidean domains, known as geometric deep learning, has attracted growing attention, due to its great potential in pattern recognition and regression for graphstructured data, see [3].
Graph neural networks (GNNs) are a typical model in geometric deep learning, which replaces the partial derivatives in CNNs by the Laplacian operator [4, 5].
The Laplacian, which carries the structural features of the data, is a secondorder isotropic differential operator that admits a natural generalization to graphs and manifolds. In GNNs, input signals are convoluted with filters under an orthonormal system for the Laplacian. However, as the algebraic properties of regular Euclidean grids are lost in general manifolds and graphs, FFTs for the Laplacian are not available. This leads to the issue that the computation of convolution for graph signal is not always efficient, especially when the graph dataset is large.
In this paper, we introduce an alternative orthonormal system on graph, the Haar basis. It then defines a new graph convolution for GNNs — Haar convolution. Due to the sparsity and locality of the Haar basis, fast Haar transforms (FHTs) can be achieved on graphstructured data. This would significantly improve the computational efficiency of GNNs as the Haar convolution guarantees the linear computational complexity. We apply Haar convolution to GNNs and give a novel type of deep convolutional neural networks on graph — HANet. Numerical tests on real graph datasets show that HANet achieves good performance and computational efficiency in classification and regression tasks. The major contributions of the paper are as follows.

•
The Haar basis is introduced for graph. Both theoretical analysis and real examples of the sparsity and locality are given. With these properties, the fast algorithms for Haar transforms (FHTs) are proposed.

•
The Haar convolution under the Haar basis is given. By FHTs, the computational cost for the Haar convolution is proportional to the size of the data on the graph. Using Haar basis for Haar convolution, the comprehensive geometric information of the graph data can be well preserved.

•
GNN with Haar convolution (HANet) is proposed. We demonstrate experimentally that HANet provides great efficiency and achieves good performance on a broad range of highdimensional regression and classification problems on graphs.
II Related Work
Graph Neural Networks. Developing deep neural networks for graphstructured data has received extensive attention in recent years [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Bruna et al. [4] first propose graph convolution, which is defined by graph Fourier transforms under the orthogonal basis of the graph Laplacian. The graph convolution uses Laplacian eigendecomposition which is computationally expensive. Defferrard et al. [17] approximate smooth filters in the spectral domain by Chebyshev polynomials. Kipf and Welling [18] simplify the convolutional layer by exploiting firstorder Chebyshev polynomial for filters. Following this line, several acceleration methods for graph convolutional networks are proposed [19, 20]. Graph wavelet neural networks [21] replace graph Fourier transform by graph wavelet transform in the graph convolution, where Chebyshev polynomials are used to approximate the graph wavelet basis [22]. Although GWNN circumvents the Laplacian eigendecomposition, the matrix innerproduct operations are nevertheless not avoidable in wavelet transforms for convolution computation. Graph convolutional networks with attention mechanisms [23, 24] can effectively learn the importance between nodes and its neighbors, which is more suitable for node classification task rather than graph regression ones. But much computational and memory cost is requested to perform the attention mechanism in the convolutional layers. Some GNN models [25, 26, 27] use multiscale information and higher order adjacency matrix to define graph convolution. To increase the scalability of the model for largescale graph, Hamilton et al. [28] propose the framework GraphSAGE with sampling and a neural network based aggregator over a fixed size node neighbor. Diffusion convolutional neural networks [29] is developed by using diffusion operator for graph convolution. MoNet [30] introduces a general methodology to define spatialbased graph convolution by the weighted average of multiple weighting functions on neighborhood. [31] provides a unified framework, the Message Passing Neural Networks (MPNNs), by which some existing GNN models are incorporated. Xu et al. [32] present a theoretical analysis for the expressive power of GNNs and propose a simple but powerful variation of GNN, the graph isomorphism network. In [33], the authors propose the path integral based convolution for GNNs (PAN) which generalizes the graph Laplacian to maximal entropy transition matrix derived from a path integral formalism.
Haar Basis and Fast Algorithms. Haar basis rooted in the theory of Haar wavelet basis as first introduced by Haar [34], is a special case of Daubechies wavelets [35], and later developed onto graph by Belkin et al. [36], see also [37]. The fast computation for sparse matrix multiplication is wellstudied (see e.g. [38]). The sparse Fourier transforms by which sparse Haar transforms can be developed are seen in [39, 40, 41].
III Graph convolution with Haar basis
IIIA Graph Fourier transform
Bruna et al. [4] first define the graph convolution based on spectral graph theory [42]. An undirected weighted graph $\mathcal{G}$ is a triplet $(V,E,w)$ with vertices $V$, edges $E$ and weights $w:E\to \mathbb{R}$. Let $$ the realvalued ${l}_{2}$ space on the graph with inner product $f\cdot g:={\sum}_{v\in V}f(v)g(v)$. The (normalized) eigenvectors ${\{{u}_{\mathrm{\ell}}\}}_{\mathrm{\ell}=1}^{V}$ of the graph Laplacian $\mathcal{L}$ forms an orthonormal basis for ${l}_{2}(\mathcal{G})$. Denoted by $N:=V$ the number of vertices of the graph and let $U:=({u}_{1},\mathrm{\dots},{u}_{N})$ be the (Fourier) base matrix whose columns form the Fourier basis for ${l}_{2}(\mathcal{G})$. The graph convolution is given by
$$g\star f=U\left(({U}^{T}g)\odot ({U}^{T}f)\right),$$  (1) 
where ${U}^{T}f$ is the adjoint discrete Fourier transform of $f$, $Uc$ is the forward discrete Fourier transform of $c$ on $\mathcal{G}$, $\odot $ is the elementwise Hadamard product.
While graph convolution defined in (1) is conceptually important, it has some limitations in practice. First, the base matrix $U$ is obtained by using eigendecomposition of the graph Laplacian in the sense that $\mathcal{L}=U\mathrm{\Lambda}{U}^{T}$, where $\mathrm{\Lambda}$ is the diagonal matrix of corresponding eigenvalues. The computational complexity is proportional to $O({N}^{3})$, which is impractical when the number of vertices of the graph is quite large. Second, the computation of the forward and inverse graph Fourier transforms (e.g. ${U}^{T}f$ and $Uc$) have $O({N}^{2})$ computational cost due to the multiplication by (dense) matrices $U$ and ${U}^{T}$. In general, there is no fast algorithms for the graph Fourier transforms. Third, filters in the spectral domain cannot guarantee the localization in the spatial (vertex) domain, and $O(Ndm)$ parameters need to be tuned in the convolutional layer with $m$ filters (hidden nodes) and $d$ features for each vertex.
To alleviate the cost of computing the graph Fourier transform, Chebyshev polynomials [17] are used to construct localized polynomial filters for graph convolution, which is called ChebNet. Kipf and Welling [18] further simplify the ChebNet to obtain graph convolutional networks (GCNs). However, such a polynomialbased approximation strategy may lose some information in the spectral graph convolutional layer, and matrix multiplication is still not avoidable as FFTs is not available for graph convolution. Thus, the graph convolution in this scenario is also computationally expensive, especially for dense graph of large size. We propose an alternative orthonormal basis that allows fast computation for the corresponding graph convolution, which then improves the scalability and efficiency of existing graph models. The basis we use is the Haar basis on a graph. The Haar basis replaces the matrix of eigenvectors $U$ in (1) and forms a highly sparse matrix, which reflects the clustering information of the graph. The sparsity of the Haar transform matrix allows fast computation (in nearly linear computational complexity) of the corresponding graph convolution.
IIIB Haar basis
A basis for ${l}_{2}(\mathcal{G})$ is a set of vectors ${\{{\varphi}_{\mathrm{\ell}}\}}_{\mathrm{\ell}=1}^{N}$ on $\mathcal{G}$ which are linearly independent and orthogonal (i.e. ${\varphi}_{\mathrm{\ell}}\cdot {\varphi}_{{\mathrm{\ell}}^{\prime}}=0$ if $\mathrm{\ell}\ne {\mathrm{\ell}}^{\prime}$). The construction of the Haar basis exploits a chain of the graph. For a graph $\mathcal{G}=(V,E,w)$, a graph ${\mathcal{G}}^{\mathrm{cg}}:=({V}^{\mathrm{cg}},{E}^{\mathrm{cg}},{w}^{\mathrm{cg}})$ is called a coarsegrained graph of $\mathcal{G}$ if ${V}^{\mathrm{cg}}\le V$ and each vertex of $\mathcal{G}$ has a parent in ${\mathcal{G}}^{\mathrm{cg}}$. Each vertex of ${\mathcal{G}}^{\mathrm{cg}}$ is called a cluster of $\mathcal{G}$. Let two integers ${J}_{0},J$, $J>{J}_{0}$. A coarsegrained chain for $\mathcal{G}$ is a set of graphs ${\mathcal{G}}_{{J}_{0}\to J}:=({\mathcal{G}}_{{J}_{0}},{\mathcal{G}}_{{J}_{0}+1},\mathrm{\dots},{\mathcal{G}}_{J})$ if ${\mathcal{G}}_{J}=\mathcal{G}$ and ${\mathcal{G}}_{j}$ is a coarsegrained graph of ${\mathcal{G}}_{j1}$ for $j=J,J1,\mathrm{\dots},{J}_{0}+1$, and ${\mathcal{G}}_{{J}_{0}}$ the top level or the coarsest level and ${\mathcal{G}}_{J}$ the bottom level or the finest level. If the top level ${\mathcal{G}}_{{J}_{0}}$ of the chain has only one node, ${\mathcal{G}}_{{J}_{0}\to J}$ becomes a tree. The chain ${\mathcal{G}}_{{J}_{0}\to J}$ gives a hierarchical partition for the graph $\mathcal{G}$. For details about graphs and chains, refer to examples in [42, 43, 37, 44].
Construction of Haar basis. With a chain of the graph, one can generate a Haar basis for ${l}_{2}(\mathcal{G})$ following [37], see also [45]. We show the construction of Haar basis on $\mathcal{G}$, as follows.
Step 1. Let ${\mathcal{G}}^{\mathrm{cg}}=({V}^{\mathrm{cg}},{E}^{\mathrm{cg}},{w}^{\mathrm{cg}})$ be a coarsegrained graph of $\mathcal{G}$. Each vertex ${v}^{\mathrm{cg}}\in {V}^{\mathrm{cg}}$ is a cluster of $\mathcal{G}$. We define ${N}^{\mathrm{cg}}$ vectors ${\varphi}_{\mathrm{\ell}}^{\mathrm{cg}}$ on ${\mathcal{G}}^{\mathrm{cg}}$ by, for ${u}^{\mathrm{cg}}\in {V}^{\mathrm{cg}}$,
$${\varphi}_{1}^{\mathrm{cg}}({u}^{\mathrm{cg}})=\mathrm{\U0001d7cf}({u}^{\mathrm{cg}})/\sqrt{{N}^{\mathrm{cg}}},$$  (2) 
and for $\mathrm{\ell}=2,\mathrm{\dots},{N}^{\mathrm{cg}}$,
${\varphi}_{\mathrm{\ell}}^{\mathrm{cg}}({u}^{\mathrm{cg}})$  (3)  
$=\sqrt{{\displaystyle \frac{{N}^{\mathrm{cg}}\mathrm{\ell}+1}{{N}^{\mathrm{cg}}\mathrm{\ell}+2}}}\left({\chi}_{\mathrm{\ell}1}^{\mathrm{cg}}({u}^{\mathrm{cg}}){\displaystyle \frac{{\sum}_{j=\mathrm{\ell}}^{{N}^{\mathrm{cg}}}{\chi}_{j}^{\mathrm{cg}}({u}^{\mathrm{cg}})}{{N}^{\mathrm{cg}}\mathrm{\ell}+1}}\right),$ 
where ${N}^{\mathrm{cg}}$ is the number of vertices of ${\mathcal{G}}^{\mathrm{cg}}$, and ${\chi}_{j}^{\mathrm{cg}}$ is the indicator function for the $j$th vertex ${u}_{j}^{\mathrm{cg}}\in {V}^{\mathrm{cg}}$ on $\mathcal{G}$ given by
$${\chi}_{j}^{\mathrm{cg}}({v}^{\mathrm{cg}}):=\{\begin{array}{cc}1,\hfill & {v}^{\mathrm{cg}}={u}_{j}^{\mathrm{cg}},\hfill \\ 0,\hfill & {v}^{\mathrm{cg}}\in {\mathcal{G}}^{\mathrm{cg}}\backslash \{{u}_{j}^{\mathrm{cg}}\}.\hfill \end{array}$$ 
Then, ${\{{\varphi}_{\mathrm{\ell}}^{\mathrm{cg}}\}}_{\mathrm{\ell}=1}^{{N}^{\mathrm{cg}}}$ forms an orthonormal basis for ${l}_{2}({\mathcal{G}}^{\mathrm{cg}})$. For each $\mathrm{\ell}=1,\mathrm{\dots},{N}^{\mathrm{cg}}$, we extend ${\varphi}_{\mathrm{\ell}}^{\mathrm{cg}}$ to a vector on $\mathcal{G}$ by
$${\varphi}_{\mathrm{\ell},1}(v):=\frac{{\varphi}_{\mathrm{\ell}}^{\mathrm{cg}}({v}^{\mathrm{cg}})}{\sqrt{{v}^{\mathrm{cg}}}},v\in {v}^{\mathrm{cg}},\mathrm{\ell}=1,\mathrm{\dots},{N}^{\mathrm{cg}},$$ 
here ${v}^{\mathrm{cg}}$ means the number of vertices in $\mathcal{G}$ whose common parent is ${v}^{\mathrm{cg}}$. For $\mathrm{\ell}=1,\mathrm{\dots},{N}^{\mathrm{cg}}$, ${u}_{\mathrm{\ell}}^{\mathrm{cg}}$ is a cluster of $\mathcal{G}$ and can be viewed as a subset of $V$. We let ${k}_{\mathrm{\ell}}:={u}_{\mathrm{\ell}}^{\mathrm{cg}}$ be the number of vertices in the cluster ${u}_{\mathrm{\ell}}^{\mathrm{cg}}$, for which we label the vertices by
$${u}_{\mathrm{\ell}}^{\mathrm{cg}}=\{{v}_{\mathrm{\ell},1},\mathrm{\dots},{v}_{\mathrm{\ell},{k}_{\mathrm{\ell}}}\}\subseteq V.$$ 
For $k=2,\mathrm{\dots},{k}_{\mathrm{\ell}}$,
$${\varphi}_{\mathrm{\ell},k}=\sqrt{\frac{{k}_{\mathrm{\ell}}k+1}{{k}_{\mathrm{\ell}}k+2}}\left({\chi}_{\mathrm{\ell},k1}\frac{{\sum}_{j=k}^{{k}_{\mathrm{\ell}}}{\chi}_{\mathrm{\ell},j}}{{k}_{\mathrm{\ell}}k+1}\right).$$ 
where ${\chi}_{\mathrm{\ell},j}$ for $j=1,\mathrm{\dots},{k}_{\mathrm{\ell}}$,
$${\chi}_{\mathrm{\ell},j}(v):=\{\begin{array}{cc}1,\hfill & v={v}_{\mathrm{\ell},j},\hfill \\ 0,\hfill & v\in \mathcal{G}\backslash \{{v}_{\mathrm{\ell},j}\}.\hfill \end{array}$$ 
The resulting $\{{\varphi}_{\mathrm{\ell},k}:\mathrm{\ell}=1,\mathrm{\dots},{N}^{\mathrm{cg}},k=1,\mathrm{\dots},{k}_{\mathrm{\ell}}\}$ is an orthonormal basis for ${l}_{2}(\mathcal{G})$.
Step 2. Let ${\mathcal{G}}_{J\to {J}_{0}}$ be a coarsegrained chain for the graph $\mathcal{G}$. An orthonormal basis ${\{{\varphi}_{\mathrm{\ell}}^{(0)}\}}_{\mathrm{\ell}=1}^{{N}_{0}}$ for ${l}_{2}({\mathcal{G}}_{{J}_{0}})$ is generated using (2) and (3). We then repeatedly use Step 1: for $j={J}_{0}+1,\mathrm{\dots},J$, we generate an orthonormal basis ${\{{\varphi}_{\mathrm{\ell}}^{(j)}\}}_{\mathrm{\ell}=1}^{{N}_{j}}$ for ${l}_{2}({\mathcal{G}}_{j})$ from the orthonormal basis ${\{{\varphi}_{\mathrm{\ell}}^{(j1)}\}}_{\mathrm{\ell}=1}^{{N}_{j1}}$ for the coarsegrained graph ${\mathcal{G}}_{j1}$ that was derived in the previous steps. We call the function at the finest level ${\varphi}_{\mathrm{\ell}}:={\varphi}_{\mathrm{\ell}}^{(J)}$, $\mathrm{\ell}=1,\mathrm{\dots},N$ the Haar global orthonormal basis or simply the Haar basis for $\mathcal{G}$ associated with the chain ${\mathcal{G}}_{J\to {J}_{0}}$. The orthonormal basis ${\{{\varphi}_{\mathrm{\ell}}^{(j)}\}}_{\mathrm{\ell}=1}^{{N}_{j}}$ for ${l}_{2}({\mathcal{G}}_{j})$, $j=J1,J2,\mathrm{\dots},{J}_{0}$ is called the associated (orthonormal) basis for the Haar basis ${\{{\varphi}_{\mathrm{\ell}}\}}_{\mathrm{\ell}=1}^{N}$.
Proposition 1.
For each level $j\mathrm{=}{J}_{\mathrm{0}}\mathrm{,}\mathrm{\dots}\mathrm{,}J$, the ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}^{\mathrm{(}j\mathrm{)}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{{N}_{j}}$ is an orthonormal basis for ${l}_{\mathrm{2}}\mathit{}\mathrm{(}{\mathrm{G}}_{j}\mathrm{)}$, and in particular, ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{N}$ is an orthonormal basis for ${l}_{\mathrm{2}}\mathit{}\mathrm{(}\mathrm{G}\mathrm{)}$; each basis ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}^{\mathrm{(}j\mathrm{)}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{{N}_{j}}$ is the Haar basis for the chain ${\mathrm{G}}_{j\mathrm{\to}{J}_{\mathrm{0}}}$.
Proposition 2.
Let ${\mathrm{G}}_{J\mathrm{\to}{J}_{\mathrm{0}}}$ be a chain. If each parent of level ${\mathrm{G}}_{j}$, $j\mathrm{=}J\mathrm{}\mathrm{1}\mathrm{,}J\mathrm{}\mathrm{2}\mathrm{,}\mathrm{\dots}\mathrm{,}{J}_{\mathrm{0}}$, contains at least two children, the number of different values of the Haar basis ${\varphi}_{\mathrm{\ell}}$, $\mathrm{\ell}\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N$, is bounded by a constant.
The Haar basis depends on the chain for the graph. If the topology of the graph is well reflected by the clustering of the chain, the Haar basis then contains the crucial geometric information of the graph. For example, by using $k$means clustering algorithm [46] or METIS algorithm [47] one can generate a chain that reveals desired geometric properties of the graph.
Figure 1b shows a chain ${\mathcal{G}}_{2\to 0}$ with $3$ levels of a graph $\mathcal{G}$. Here, for each level, the vertices are given by
${V}^{(2)}$  $=V=\{{v}_{1},\mathrm{\dots},{v}_{8}\},$  
${V}^{(1)}$  $=\{{v}_{1}^{(1)},{v}_{2}^{(1)},{v}_{3}^{(1)},{v}_{4}^{(1)}\}$  
$=\{\{{v}_{1},{v}_{2}\},\{{v}_{3},{v}_{4}\},\{{v}_{5},{v}_{6}\},\{{v}_{7},{v}_{8}\}\},$  
${V}^{(0)}$  $=\{{v}_{1}^{(0)},{v}_{2}^{(0)}\}=\{\{{v}_{1}^{(1)},{v}_{2}^{(1)}\},\{{v}_{3}^{(1)},{v}_{4}^{(1)}\}\}.$ 
Figure 1a shows the Haar basis for the chain ${\mathcal{G}}_{2\to 0}$. There are in total $8$ vectors of the Haar basis for $\mathcal{G}$. From construction, the Haar basis ${\varphi}_{\mathrm{\ell}}$ and the associated basis ${\varphi}_{\mathrm{\ell}}^{(j)}$, $j=1,2$ are closely connected: the ${\varphi}_{1},{\varphi}_{2}$ can be reduced to ${\varphi}_{1}^{(2)},{\varphi}_{2}^{(2)}$ and the ${\varphi}_{1},{\varphi}_{2},{\varphi}_{3},{\varphi}_{4}$ can be reduced to ${\varphi}_{1}^{(1)},{\varphi}_{2}^{(1)},{\varphi}_{3}^{(1)},{\varphi}_{4}^{(1)}$. This connection would allow fast algorithms for Haar transforms as given in Algorithms 1 and 2. In Figure 1, the matrix ${\mathrm{\Phi}}^{T}$ of the $8$ Haar basis vectors ${\varphi}_{\mathrm{\ell}}$ on $\mathcal{G}$ has good sparsity. With the increase of the graph size, the sparsity of the Haar basis matrix $\mathrm{\Phi}$ becomes more prominent. We will later demonstrate this finding in our experimental study.
IIIC Haar convolution
With the Haar basis constructed in Section IIIB, we can define Haar convolution as an alternative form of spectral graph convolution in (1). Let ${\{{\varphi}_{\mathrm{\ell}}\}}_{\mathrm{\ell}=1}^{N}$ be the Haar basis associated with a chain ${\mathcal{G}}_{J\to {J}_{0}}$ of a graph $\mathcal{G}$. Denoted by $\mathrm{\Phi}=({\varphi}_{1},\mathrm{\dots},{\varphi}_{N})\in {R}^{N\times N}$ the Haar transform matrix. We define by
$${\mathrm{\Phi}}^{T}f=(\sum _{v\in V}{\varphi}_{1}(v)f(v),\mathrm{\dots},\sum _{v\in V}{\varphi}_{N}(v)f(v))$$  (4) 
the adjoint Haar transform for the signal $f$ on $\mathcal{G}$, and by
$$(\mathrm{\Phi}c)(v)=\sum _{\mathrm{\ell}=1}^{N}{\varphi}_{\mathrm{\ell}}(v){c}_{\mathrm{\ell}},v\in V,$$  (5) 
the forward Haar transform for (coefficients) vector $c:=({c}_{1},\mathrm{\dots},{c}_{N})\in {\mathbb{R}}^{N}$. We call the matrix $\mathrm{\Phi}$ Haar transform matrix.
Definition 3.
The Haar convolution for a filter $g$ and a signal $f$ on $\mathrm{G}$ can be defined as
$$g\star f=\mathrm{\Phi}(({\mathrm{\Phi}}^{T}g)\odot ({\mathrm{\Phi}}^{T}f)).$$  (6) 
Computationally, (6) is obtained by performing forward Haar transform of the elementwise Hadamard product between adjoint Haar transform of $g$ and $f$. Compared with the Laplacian based spectral graph convolution given in (1), the Haar convolution has several characteristics: (i) the Haar transform matrix $\mathrm{\Phi}$ is sparse so that the computation of ${\mathrm{\Phi}}^{T}f$ or $\mathrm{\Phi}c$ is more efficient than ${U}^{T}f$ or $Uc$; (ii) as the Haar basis is constructed based on the chain of the graph which reflects the clustering property for vertices, the Haar convolution can potentially extract abstract features (localized components) over $f$, which can be viewed as the learning representation for graph signal. From this perspective, Haar convolution might be more suitable for graphlevel based modelling tasks (e.g. graph classification/regression) than nodelevel ones (e.g. semisupervised node classification); (iii) by means of the sparsity of the Haar basis, adjoint and forward Haar transforms can be implemented by the fast algorithms that have nearly linear computational complexity (with respect to the size of $f$).
We can change the execution of ADHTs on the filter $g$ (e.g. ${\mathrm{\Phi}}^{T}g$) and regard the filter as defined in the frequency domain, and can thus refine Haar convolution as $g\star f=\mathrm{\Phi}(g\odot ({\mathrm{\Phi}}^{T}f))$.
Fast algorithms for Haar tranforms. The computation of Haar transforms can also be accelerated by using sparse matrix multiplications due to the sparsity of the Haar transform matrix. This would allow the linear computational complexity $O(\u03f5N)$ with sparsity $1\u03f5$ of the Haar transform matrix. Moreover, a similar computational strategy to the sparse Fourier transforms [41, 48] can be used to have the Haar transforms achieve an even faster algorithm with complexity $O(k\mathrm{log}N)$ for graph with $N$ nodes and Haar transform matrix with $k$ nonzero elements. By the sparsity of Haar transform matrix, fast Haar transforms (FHTs) which includes adjoint Haar transform and forward Haar transform can be developed to speed up the implementation of Haar convolution. Theorems 4 and 5 in the following section show that the computational cost of adjoint and forward Haar transform can reach $\mathcal{O}(N)$ and $\mathcal{O}(N{(\mathrm{log}N)}^{2})$. They are nearly linear computational complexity and are then called fast Haar transforms (FHTs). Meanwhile, the Haar convolution can be evaluated by FHTs in $\mathcal{O}(N{(\mathrm{log}N)}^{2})$ steps.
Weight sharing. We can use weight sharing in Haar convolution to reduce the number of parameters of the filter, and capture the common feature of the nodes which are in the same cluster. As the clustering contains the information of the neighbourhood, we can use the chain ${\mathcal{G}}_{J\to {J}_{0}}$ for weight sharing: the vertices of the graph which have the same parent at a coarser level share a parameter of the filter. Here, the coarser level is some fixed level ${J}_{1}$, $$. For example, the weight sharing rule for chain ${\mathcal{G}}_{2\to 0}$ in Figure 1b is: assign the weight ${g}_{i}$ for each node ${v}_{i}^{(0)}$, $i=1,2$ on the top level, the filter (or the weight vector) at the bottom level is then $g=({g}_{1},{g}_{1},{g}_{1},{g}_{1},{g}_{2},{g}_{2},{g}_{2},{g}_{2})$. In this way, we can use the filter $g$ with two independent parameters ${g}_{1},{g}_{2}$ to convolute with the input vector with $8$ components.
IV Fast algorithms under Haar basis
For the Haar convolution introduced in Definition 6, we can develop an efficient computational strategy by virtue of the sparsity of the Haar basis, as mentioned. Let ${\mathcal{G}}_{J\to {J}_{0}}$ be a chain of the graph $\mathcal{G}$. For convenience, we label the vertices of the level$j$ graph ${\mathcal{G}}_{j}$ by ${V}_{j}:=\{{v}_{1}^{(j)},\mathrm{\dots},{v}_{{N}_{j}}^{(j)}\}$.
Fast computation for adjoint Haar transform ${\mathrm{\Phi}}^{T}\mathbf{}f$. The adjoint Haar transform in (4) can be computed in the following way. For $j={J}_{0},\mathrm{\dots},J1$, let ${c}_{k}^{(j)}$ be the number of children of ${v}_{k}^{(j)}$, i.e. the number of vertices of ${\mathcal{G}}_{j+1}$ which belongs to the cluster ${v}_{k}^{(j)}$, for $k=1,\mathrm{\dots},{N}_{j}$. For $j=J$, let ${c}_{k}^{(J)}\equiv 1$ for $k=1,\mathrm{\dots},N$. For $j={J}_{0},\mathrm{\dots},J$ and $k=1,\mathrm{\dots},{N}_{j}$, we define the weight factor for ${v}_{k}^{(j)}$ by
$${w}_{k}^{(j)}:=1/\sqrt{{c}_{k}^{(j)}}.$$  (7) 
Let ${W}_{J\to {J}_{0}}:=\{{w}_{k}^{(j)}:j={J}_{0},\mathrm{\dots},J,k=1,\mathrm{\dots},{N}_{j}\}$. Then, the weighted chain $({\mathcal{G}}_{J\to {J}_{0}},{W}_{J\to {J}_{0}})$ is a filtration if each parent in the chain ${\mathcal{G}}_{J\to {J}_{0}}$ has at least two children. See e.g. [37, Definition 2.3].
Let ${\{{\varphi}_{\mathrm{\ell}}\}}_{\mathrm{\ell}=1}^{N}$ be the Haar basis for the filtration $({\mathcal{G}}_{J\to {J}_{0}},{W}_{J\to {J}_{0}})$ of a graph $\mathcal{G}$. We define the weighted sum for $f\in {l}_{2}(\mathcal{G})$ by
$${\mathcal{S}}^{(J)}(f,{v}_{k}^{(J)}):=f({v}_{k}^{(J)}),{v}_{k}^{(J)}\in {\mathcal{G}}_{J},$$  (8) 
and for $j={J}_{0},\mathrm{\dots},J1$ and ${v}_{k}^{(j)}\in {\mathcal{G}}_{j}$,
$${\mathcal{S}}^{(j)}(f,{v}_{k}^{(j)}):=\sum _{{v}_{{k}^{\prime}}^{(j+1)}\in {v}_{k}^{(j)}}{w}_{{k}^{\prime}}^{(j+1)}{\mathcal{S}}^{(j+1)}(f,{v}_{{k}^{\prime}}^{(j+1)}).$$  (9) 
For each vertex ${v}_{k}^{(j)}$ of ${\mathcal{G}}_{j}$, the ${\mathcal{S}}^{(j)}(f,{v}_{k}^{(j)})$ is the weighted sum of the ${\mathcal{S}}^{(j+1)}(f,{v}_{{k}^{\prime}}^{(j+1)})$ at the level $j+1$ for those vertices ${v}_{{k}^{\prime}}^{(j+1)}$ of ${\mathcal{G}}_{j+1}$ whose parent is ${v}_{k}^{(j)}$.
The adjoint Haar transform can be evaluated by the following theorem.
Theorem 4.
Let ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{N}$ be the Haar basis for the filtration $\mathrm{(}{\mathrm{G}}_{J\mathrm{\to}{J}_{\mathrm{0}}}\mathrm{,}{W}_{J\mathrm{\to}{J}_{\mathrm{0}}}\mathrm{)}$ of a graph $\mathrm{G}$. Then, the adjoint Haar transform for the vector $f$ on the graph $\mathrm{G}$ can be computed by, for $\mathrm{\ell}\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N$,
$${({\mathrm{\Phi}}^{T}f)}_{\mathrm{\ell}}=\sum _{k=1}^{{N}_{j}}{\mathcal{S}}^{(j)}(f,{v}_{k}^{(j)}){w}_{k}^{(j)}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{k}^{(j)}),$$  (10) 
where ${\varphi}_{\mathrm{\ell}}^{\mathrm{(}j\mathrm{)}}$ is the $\mathrm{\ell}$th member of the orthonormal basis for ${l}_{\mathrm{2}}\mathit{}\mathrm{(}{\mathrm{G}}_{j}\mathrm{)}$ associated with the Haar basis ${\varphi}_{\mathrm{\ell}}$ (see Section IIIB), ${v}_{k}^{\mathrm{(}j\mathrm{)}}$ are the vertices of ${\mathrm{G}}_{j}$ and the weight factors ${w}_{k}^{\mathrm{(}j\mathrm{)}}$ are given by (7).
Proof.
By the relation between ${\varphi}_{\mathrm{\ell}}$ and ${\varphi}_{\mathrm{\ell}}^{(j)}$,
${({\mathrm{\Phi}}^{T}f)}_{\mathrm{\ell}}$  $={\displaystyle \sum _{k=1}^{N}}f({v}_{k}^{(J)}){\varphi}_{\mathrm{\ell}}({v}_{k}^{(J)})$  
$={\displaystyle \sum _{{k}^{\prime}=1}^{{N}_{J1}}}\left({\displaystyle \sum _{{v}_{k}^{(J)}\in {v}_{{k}^{\prime}}^{(J1)}}}f({v}_{k}^{(J)})\right){w}_{{k}^{\prime}}^{(J1)}{\varphi}_{\mathrm{\ell}}^{(J1)}({v}_{{k}^{\prime}}^{(J1)})$  
$={\displaystyle \sum _{{k}^{\prime}=1}^{{N}_{J1}}}{\mathcal{S}}^{(J1)}(f,{v}_{{k}^{\prime}}^{(J1)}){w}_{{k}^{\prime}}^{(J1)}{\varphi}_{\mathrm{\ell}}^{(J1)}({v}_{{k}^{\prime}}^{(J1)})$  
$={\displaystyle \sum _{{k}^{\prime \prime}=1}^{{N}_{J2}}}\left({\displaystyle \sum _{{v}_{{k}^{\prime}}^{(J1)}\in {v}_{{k}^{\prime \prime}}^{(J2)}}}{\mathcal{S}}^{(J1)}(f,{v}_{{k}^{\prime}}^{(J1)}){w}_{{k}^{\prime}}^{(J1)}\right)$  
$\mathrm{\hspace{1em}}\times {w}_{{k}^{\prime \prime}}^{(J2)}{\varphi}_{\mathrm{\ell}}^{(J2)}({v}_{{k}^{\prime \prime}}^{(J2)})$  
$={\displaystyle \sum _{{k}^{\prime \prime}=1}^{{N}_{J2}}}{\mathcal{S}}^{(J2)}(f,{v}_{{k}^{\prime \prime}}^{(J2)}){w}_{{k}^{\prime \prime}}^{(J2)}{\varphi}_{\mathrm{\ell}}^{(J2)}({v}_{{k}^{\prime \prime}}^{(J2)})$  
$\mathrm{\cdots}$  
$={\displaystyle \sum _{k=1}^{{N}_{j}}}{\mathcal{S}}^{(j)}(f,{v}_{k}^{(j)}){w}_{k}^{(j)}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{k}^{(j)}),$ 
thus completing the proof. ∎
Fast computation for forward Haar transform $\mathrm{\Phi}\mathbf{}c$. The forward Haar transform in (5) can be computed, as follows.
Theorem 5.
Let ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{N}$ be the Haar basis for a filtration (${\mathrm{G}}_{J\mathrm{\to}{J}_{\mathrm{0}}}$,${W}_{J\mathrm{\to}{J}_{\mathrm{0}}}$) of the graph $\mathrm{G}$ and ${\mathrm{\{}{\varphi}_{\mathrm{\ell}}^{\mathrm{(}j\mathrm{)}}\mathrm{\}}}_{\mathrm{\ell}\mathrm{=}\mathrm{1}}^{{N}_{j}}$, $j\mathrm{=}{J}_{\mathrm{0}}\mathrm{,}\mathrm{\dots}\mathrm{,}J$ is the associated bases at ${\mathrm{G}}_{j}$. Then, the forward Haar transform for vector $c\mathrm{=}\mathrm{(}{c}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{c}_{N}\mathrm{)}$ in ${\mathrm{R}}^{N}$ can be computed by, for $k\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N$,
$${(\mathrm{\Phi}c)}_{k}=\sum _{j=1}^{J}{W}_{k}^{(j)}\left(\sum _{\mathrm{\ell}={N}_{j1}+1}^{{N}_{j}}{c}_{\mathrm{\ell}}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{{k}_{j}}^{(j)})\right),$$ 
where for $k\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}N$, ${v}_{{k}_{j}}^{\mathrm{(}j\mathrm{)}}$ is the parent of ${v}_{k}^{\mathrm{(}J\mathrm{)}}$ at level $j$, and ${W}_{k}^{\mathrm{(}J\mathrm{)}}\mathrm{:=}\mathrm{1}$ and
$${W}_{k}^{(j)}:=\prod _{n=2}^{j}{w}_{{k}_{n}}^{(n)}\mathit{\text{for}}j={J}_{0},\mathrm{\dots},J1,$$  (11) 
where the weight factors ${w}_{{k}_{n}}^{\mathrm{(}n\mathrm{)}}$ for $n\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{\dots}\mathrm{,}J$ are given by (7).
Proof.
Let ${N}_{j}:={V}_{j}$ for $j={J}_{0},\mathrm{\dots},J$ and ${N}_{{J}_{0}1}:=0$. For $k=1,\mathrm{\dots},{N}_{J}$, let ${v}_{k}^{(J)}$ the $k$th vertex of ${\mathcal{G}}_{J}$. For $i={J}_{0},\mathrm{\dots},J1$, there exists ${k}_{i}=1,\mathrm{\dots},{N}_{j}$ such that ${v}_{{k}_{i}}^{(i)}$ the parent at level $i$ of ${v}_{k}^{(J)}$. By the property of the Haar basis, for each eigenvector ${\varphi}_{\mathrm{\ell}}$ there exists $j={J}_{0},\mathrm{\dots},J$ such that $\mathrm{\ell}={N}_{j1}+1,\mathrm{\dots},{N}_{j}$, ${\varphi}_{\mathrm{\ell}}$ is a constant for the vertices of ${\mathcal{G}}_{J}=\mathcal{G}$ who have the same parent at level $j$. Then,
${\varphi}_{\mathrm{\ell}}({v}_{k}^{(J)})$  $={w}_{{k}_{J1}}^{(J1)}{\varphi}_{\mathrm{\ell}}^{(J1)}({v}_{{k}_{J1}}^{(J1)})$  
$={w}_{{k}_{J1}}^{(J1)}{w}_{{k}_{J2}}^{(J2)}{\varphi}_{\mathrm{\ell}}^{(J2)}({v}_{{k}_{J2}}^{(J2)})$  
$=\left({\displaystyle \prod _{n={J}_{0}}^{j}}{w}_{{k}_{n}}^{(n)}\right){\varphi}_{\mathrm{\ell}}^{(j)}({v}_{{k}_{j}}^{(j)})$  
$={W}_{k}^{(j)}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{{k}_{j}}^{(j)}).$  (12) 
where the product of the weights in the third equality only depends upon the level $j$ and the vertex ${v}_{k}^{(1)}$, and we have let
$${W}_{k}^{(j)}:=\prod _{n=1}^{j}{w}_{{k}_{n}}^{(n)}.$$ 
By (IV),
$\mathrm{\Phi}(c,{v}_{k}^{(J)})$  $={\displaystyle \sum _{\mathrm{\ell}=1}^{N}}{c}_{\mathrm{\ell}}{\varphi}_{\mathrm{\ell}}({v}_{k}^{(J)})$  
$={\displaystyle \sum _{j={J}_{0}}^{J}}{\displaystyle \sum _{\mathrm{\ell}={N}_{j1}+1}^{{N}_{j}}}{c}_{\mathrm{\ell}}{\varphi}_{\mathrm{\ell}}({v}_{k}^{(J)})$  
$={\displaystyle \sum _{j={J}_{0}}^{J}}{\displaystyle \sum _{\mathrm{\ell}={N}_{j1}+1}^{{N}_{j}}}{c}_{\mathrm{\ell}}{W}_{k}^{(j)}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{{k}_{j}}^{(j)})$  
$={\displaystyle \sum _{j={J}_{0}}^{J}}{W}_{k}^{(j)}\left({\displaystyle \sum _{\mathrm{\ell}={N}_{j1}+1}^{{N}_{j}}}{c}_{\mathrm{\ell}}{\varphi}_{\mathrm{\ell}}^{(j)}({v}_{{k}_{j}}^{(j)})\right),$ 
thus completing the proof. ∎
 1.

2.
For each $\mathrm{\ell}$, let $j$ be the integer such that ${N}_{j1}+1\le \mathrm{\ell}\le {N}_{j}$, where ${N}_{{J}_{0}1}:=0$. Evaluating
Computational Complexity Analysis. Algorithm 1 gives the computational steps for the evaluation of ${({\mathrm{\Phi}}^{T}f)}_{\mathrm{\ell}}$, $\mathrm{\ell}=1,\mathrm{\dots},N$ in Theorem 4. In the first step of Algorithm 1, the total number of summations to compute all elements of (13) is no more than ${\sum}_{i=0}^{j1}{N}_{i+1}$; In the second step, the total number of multiplication and summation operations is at most $2{\sum}_{\mathrm{\ell}=1}^{N}C=\mathcal{O}(N)$. Here $C$ is the constant which bounds the number of distinct values of the Haar basis (see Proposition 2). Thus, the computational steps of Algorithm 1 are $\mathcal{O}(N)$.
By Theorem 5, the evaluation of the forward Haar transform $\mathrm{\Phi}c$ can be implemented by Algorithm 2. In the first step of Algorithm 2, the number of multiplications is no more than ${\sum}_{\mathrm{\ell}=1}^{N}C=\mathcal{O}(N)$; in the second step, the number of summations is no more than ${\sum}_{\mathrm{\ell}=1}^{N}C=\mathcal{O}(N)$ as well; in the third step, the computational steps are $\mathcal{O}(N{(\mathrm{log}N)}^{2})$; in the last step, the total number of summations and multiplications is $\mathcal{O}(N\mathrm{log}N)$. Thus, the total computational steps are $\mathcal{O}(N{(\mathrm{log}N)}^{2})$.
Hence, Algorithms 1 and 2 have linear computational cost (up to a $\mathrm{log}N$ term). We call these two algorithms fast Haar transforms (FHTs) under Haar basis on the graph.
Proposition 6.
Proposition 6 shows that DFTs can recover the graph signal $f$ from the forward Haar transform ${\mathrm{\Phi}}^{T}f$. This means that forward and adjoint Haar transforms have zeroloss in graph signal transmission.
The Haar convolution can be evaluated fast by FHTs in Algorithms 1 and 2. We show the computational steps for Haar convolution in Algorithm 3. From the above discussion, the computational complexity of Algorithm 3 is $\mathcal{O}(N{(\mathrm{log}N)}^{2})$. We call Algorithm 3 the fast Haar convolution (algorithm). This can be used in graph neural networks which we introduce now.
V Graph neural networks with Haar transforms
VA Models
The Haar convolution of (1) can be applied to all architectures of graph neural networks. For graph classification and regression tasks, we can apply the model with convolutional layer consisting of $m$hidden neutrons and a nonlinear activation function $\sigma $ (e.g. ReLU): for $i=1,2\mathrm{\dots},m$,
${f}_{i}^{\mathrm{out}}$  $=\sigma \left({\displaystyle \sum _{j=1}^{d}}\mathrm{\Phi}\left({g}_{i,j}\odot ({\mathrm{\Phi}}^{T}{f}_{j}^{\mathrm{in}})\right)\right)$  
$=\sigma \left({\displaystyle \sum _{j=1}^{d}}\mathrm{\Phi}{G}_{i,j}{\mathrm{\Phi}}^{T}{f}_{j}^{\mathrm{in}}\right),$  (14) 
for input graph data ${F}^{\mathrm{in}}=({f}_{1}^{\mathrm{in}},{f}_{2}^{\mathrm{in}},\mathrm{\dots},{f}_{d}^{\mathrm{in}})\in {R}^{N\times d}$ with $N$ nodes and $d$ input features (for each vertex). Here, the feature ${f}_{j}^{\mathrm{in}}$ of the input signal is convolved with the learnable filter ${g}_{i,j}\in {\mathbb{R}}^{N}$ by Haar transforms, and then all Haartransformed features are combined as a new feature ${f}_{i}^{\mathrm{out}}$. This gives the output matrix ${F}^{\mathrm{out}}=({f}_{1}^{\mathrm{out}},{f}_{2}^{\mathrm{out}},\mathrm{\dots},{f}_{m}^{\mathrm{out}})\in {\mathbb{R}}^{N\times m}$. If we write ${G}_{i,j}\in {\mathbb{R}}^{N\times N}$ as the diagonal matrix of filter ${g}_{i,j}$, the convolutional layer has the compact form of the second equality in (VA). We call the GNNs with Haar convolution in (VA) HANet.
Weight detaching. For each layer, $O(Ndm)$ parameters need to be tuned. To reduce the number of parameters, we can replace the filter matrix ${G}_{i,j}$ by a unified diagonal filter matrix $G$ and a compression matrix $W\in {R}^{d\times m}$ (which is a detaching approach used in conventional CNN for extracting features). This then leads to a concise form
$$\widehat{F}=\sigma \left(\mathrm{\Phi}\left(G({\mathrm{\Phi}}^{T}F)\right)W\right).$$  (15) 
Then, it requires $O(N+dm)$ parameters to train. Recall that constructing the Haar basis uses a chain ${\mathcal{G}}_{J\to {J}_{0}}$ for the graph $\mathcal{G}$, one can implement weight sharing based on the same chain structure. Specifically, $k$means clustering algorithm [46] or METIS algorithm [47] can be used to generate a chain that reveals desired geometric properties of the graph. Suppose we consider a coarser level ${J}_{1}$ ($$) having $K$ clusters, then all vertices in the same cluster share the common filter parameter. Similarly, the corresponding children vertices in level ${J}_{1}1$ share the same filter parameters as used in their parent vertices, until the bottom level corresponds to the whole set of vertices of the input graph. Thus, the parameter complexity is reduced to $\mathcal{O}(K+dm)$.
The HANet can be evaluated by performing $d$times fast Haar convolutions (consisting of $d$times adjoint and forward Haar transforms). The total computational cost is $\mathcal{O}(N{(\mathrm{log}N)}^{2}d)$. Deep GNNs with Haar convolution can be built by stacking multiple Haar convolutional layers of (15), followed by an output layer.
HANet for graph classification and regression. Graph classification and regression can be formulated as supervised learning. Given a collection of graphstructured data ${\{{f}_{i}\}}_{i=1}^{n}$ with labels ${\{{y}_{i}\}}_{i=1}^{n}$, the objective of the classification task is to find a mapping that can classify or predict labels. The model of HANet uses a similar architecture of deep convolutional neural network which has several Haar convolutional layers and fully connected dense layers. Figure 2a shows the flowchat of HANet with multiple Haar convolutional layers: the chain ${\mathcal{G}}_{J\to {J}_{0}}$ and the Haar basis ${\varphi}_{\mathrm{\ell}}$ and the associated basis ${\varphi}_{\mathrm{\ell}}^{(j)}$, $j={J}_{0},\mathrm{\dots},J$ are precomputed; graphstructured input $f$ is Haarconvoluted with filter $g$ which is of length $N$ but with ${N}_{J1}$ independent parameters, where $g$ is expanded from level $J1$ to $J$ by weight sharing, and the output ${f}^{\mathrm{out}}$ of the first layer is the ReLU of the Haar convolution of $g$ and $f$; the graph pooling reduces ${f}^{\mathrm{out}}$ of size ${N}_{J}$ to ${\stackrel{~}{f}}^{\mathrm{in}}$ of size ${N}_{J1}$; and in the second Haar convolutional layer, the input is ${\stackrel{~}{f}}^{\mathrm{in}}$ and the Haar basis is ${\varphi}_{\mathrm{\ell}}^{(J1)}$; the following layers continue this process; the final Haar convolutional layer is fully connected by one or multiple dense layers. For classification, the softmax function is applied to the last dense layer.
HANet for node classification. In node classification, the whole graph is the only single input, where a small proportion of nodes are labeled. The output is the graph with all unknown labels predicted. We can use the network with two layers as
$$\text{HANet}({f}^{\mathrm{in}}):=\mathrm{softmax}\left({\mathrm{HC}}^{(2)}\left(\mathrm{ReLU}\left({\mathrm{HC}}^{(1)}\left({f}^{\mathrm{in}}\right)\right)\right)\right)$$  (16) 
where ${\mathrm{HC}}^{(1)}$ and ${\mathrm{HC}}^{(2)}$ are the Haar convolutional layers
$${\mathrm{HC}}^{(i)}(f):=\widehat{A}({w}_{1}^{(i)}\star f){w}_{2}^{(i)},i=1,2,$$ 
where we use the modified Haar convolution ${w}_{1}^{(i)}\star f=\mathrm{\Phi}\left({w}_{1}^{(i)}\odot ({\mathrm{\Phi}}^{T}f)\right)$. For a graph with $N$ nodes and $M$ features, in the first Haar convolutional layer, the filter ${w}_{1}^{(1)}$ contains ${N}_{0}\times M$ parameters and is extended to a matrix $N\times M$ by weight sharing, where ${N}_{0}$ is the number of nodes at the coarsest level. The ${w}_{2}^{(1)}$ plays the role of weight compression and feature extraction. The first layer is activated by the rectifier and the second layer is fully connected with softmax. The $\widehat{A}$, which is defined in [18], is the square matrix of size $N$ determined by the adjacency matrix of the input graph. This smoothing operation compensates the information lost in coarsening by taking a weighted average of features of each vertex and its neighbours. For vertices that are densely connected, it makes their features more similar and significantly improves the ease of node classification task [49].
VB Technical components
Fast computation for HANet. Complexity analysis of FHTs in the previous section shows that HANet is more efficient than GNNs with Fourier basis. The graph convolution of the latter will inccur $\mathcal{O}({N}^{3})$ computational steps. Various strategies are proposed to improve the computational performance for graph convolution. For example, ChebNet [17] and GCN [18] use localized polynomial approximation for the spectral filters; GWNN [21] constructs sparse and localized graph wavelet basis matrix for graph convolution. These methods implement the multiplication between a sparse matrix (e.g. the refined adjacency matrix $\widehat{A}$ in GCN or the wavelet basis matrix ${\psi}_{s}$ in GWNN [21]) and input matrix $F$ in the convolutional layer. However, to compute either $\widehat{A}F$ or ${\psi}_{s}F$, the computational complexity to a great extent relies on the sparse degree of $\widehat{A}$ or ${\psi}_{s}$, i.e. roughly proportional to $\mathcal{O}(\epsilon {N}^{2}d)$, where $\epsilon $, $\epsilon \in [0,1]$, represents the percentage of nonzero elements in a square matrix. The $\mathcal{O}(\epsilon {N}^{2}d)$ may be significantly higher than $\mathcal{O}(N{(\mathrm{log}N)}^{2}d)$ as long as $\epsilon $ is not extremely small, indicating that our FHTs usually outperform these methods especially when $N$ is quite large and $\epsilon \approx 1$. Moreover, sparse matrix multiplication or sparse FHTs as mentioned can be used to improve the computational speed of Haar convolution.
Chain. In HANet, the chain and the Haar basis can be precomputed since the graph structure is already known. In particular, the chain is computed by a modified version of the METIS algorithm [47], which fast generates a chain for the weight matrix of a graph. In general, the parents of a chain from METIS have at least two children, and then the weighted chain is a filtration and the condition of Proposition 2 is satisfied.
Weight sharing for filter. In the HANet one can use weight sharing given in Section IIIC for filters. By doing this, we exploit the local geometry of the graphstructured data to extract the common feature of neighbour nodes and meanwhile reduce the independent parameters of the filter. Weight sharing can take place in each convolutional layer of HANet. For chain ${\mathcal{G}}_{J\to {J}_{0}}$ with which the Haar basis is associated, weight sharing can act from the coarsest level ${J}_{0}$ to the finest level $J$ or from any level coarser than $J$ to $J$. For a filtration, the weight sharing could shrink the number of parameters by rate ${2}^{(J{J}_{0})}$ at least, see Figure 2b.
Graph pooling. We use max graph pooling between two convolutional layers of HANet. Each pooled input is the maximum of the previous layer whose locations are at the children of the current level. The pooling uses the same chain as the Haar basis at the same layer. For example, after pooling, the second layer uses the chain ${\mathcal{G}}_{(J1)\to {J}_{0}}$, as illustrated in Figure 2. By the construction of Haar basis in Section IIIB, the new Haar basis associated with ${\mathcal{G}}_{(J1)\to {J}_{0}}$ is exactly the precomputed basis ${\{{\varphi}_{\mathrm{\ell}}^{(J1)}\}}_{\mathrm{\ell}=1}^{{N}_{J1}}$.
VI Experiments
In this section, we test the proposed graph neural networks with Haar transforms on MNIST (graph signal classification), Quantum Chemistry (graph regression) and Citation Networks (node classification). The experiments for graph classification were carried out under the Google Colab environment with Tesla K80 GPU while for node classification were under the unix environment with a 3.3GHz Intel Core i7 CPU and 16GB RAM. All the methods were implemented in TensorFlow. SGD+Momentum and Adam optimization methods were used in the experiments.
VIA MNIST for graph signal classification.
We treat the MNIST digits classification as a learning problem on graphs, following [17, 30], where each pixel of an image with resolution $28\times 28$ is one of the $784$ vertices of the graph. Of $70,000$ images, we use $55,000$ for training, $5,000$ for validation and the remaining $10,000$ for test. We compare HANet against GNN with graph Laplacian [4] and ChebNet [17], all with LeNet5like architecture [50]. The edges are determined by the spatial relation between vertices. A vertex has an edge to the $8$ neighbours and has no edge to other vertices. The task is to recognize an image of handwritten digit as one of the ten digits $0,1,\mathrm{\dots},9$. We use similar hyperparameter setting from ChebNet^{1}^{1} 1 https://github.com/mdeff/cnn_graph: dropout probability $0.5$, regularization weight $2\times {10}^{4}$, initial learning rate $0.02$, learning rate decay $0.95$, momentum $0.9$. The chain for Haar basis is ${\mathcal{G}}_{5\to 0}$. The finest level ${\mathcal{G}}_{5}=\mathcal{G}$ has $784$ nodes and the graphs at the following levels have $218$, $64$, $18$, $6$ and $3$ nodes. For weight sharing and graph pooling we use the chain ${\mathcal{G}}_{(Jk+1)\to Jk}$ for the $k$th convolutional layer. The test accuracy of HANet for MNIST is $98.60\%$, which is close to $99.17\%$ of ChebNet and higher than $96.26\%$ of the GNN with graph Laplacian. This shows that HANet is able to learn complicated big graph data in graph signal classification where the Haar convolution preserves the geometric information of data.
VIB Quantum Chemistry for graph regression.
We test HANet on QM7 [51, 52]. The QM7 contains $7165$ molecules, each of which is represented by the Coulomb (energy) matrix and labeled with atomization energy. We treat each molecule as a weighted graph where the nodes are the atoms and the adjacency matrix is the $23\times 23$Coulomb matrix of the molecule, where the true number of atoms may be less than $23$. As in most cases the adjacency matrix is not full ranked, we take the average of the Coulomb matrices of all molecules as the common adjacency matrix, for which we generate the Haar basis. To avoid exploding gradients in parameter optimization, we take the standard score of each entry over all Coulomb matrices as input.
Method  Test MAE 

RF  $122.7\pm 4.2$ 
Multitask  $123.7\pm 15.6$ 
KRR  $110.3\pm 4.7$ 
GC  $77.9\pm 2.1$ 
Multitask(CM)  $10.8\pm 1.3$ 
KRR(CM)  $10.2\pm 0.3$ 
DTNN  $8.8\pm 3.5$ 
ANI1  $2.86\pm 0.25$ 
HANet  $\mathbf{9.50}\mathbf{\pm}\mathbf{0.71}$ 
The architecture of HANet contains $2$ layers of Haar convolution with $8$ and $2$ filters and then $2$ fully connected layers with $400$ and $100$ neurons. As the graph is not big, we do not use graph pooling or weight sharing. Following [53], we use mean squared error (MSE) plus regularization as the loss function in training and mean absolute error (MAE) as the test metric. We repeat the experiment over $5$ splits with the same proportion of training and test data but with different random seeds. We report the average performance and standard deviation of HANet in Table 1 compared against other public results [54] by methods Random Forest (RF) [55], Multitask Networks (Multitask) [56], Kernel Ridge Regression (KRR) [57], Graph Convolutional models (GC) [58], Deep Tensor Neural Network (DTNN) [59], ANI1 [60], KRR and Multitask with Coulomb Matrix featurization (KRR(CM)/Multitask(CM)) [54]. It shows that HANet ranks third in the list with average test MAE $9.50$ and average relative MAE $4.31\times {10}^{6}$, and thus offers a good approximator for QM7 regression.
Method  Citeseer  Cora  Pubmed 

MLP  $55.1$  $46.5$  $71.4$ 
ManiReg  $60.1$  $59.5$  $70.7$ 
SemiEmb  $59.6$  $59.0$  $71.1$ 
LP  $45.3$  $68.0$  $63.0$ 
DeepWalk  $43.2$  $67.2$  $65.3$ 
ICA  69.1  $75.1$  $73.9$ 
Planetoid  $64.7$  $75.7$  $77.2$ 
ChebNet  $69.8$  $81.2$  $74.4$ 
GCN  $70.3$  $81.5$  $79.0$ 
HANet  $\mathbf{70.1}$  $\mathbf{81.9}$  $\mathbf{79.3}$ 
Dataset  Haar basis size  Sparsity  Basis Generate Time (s)  Adjoint FHT Time (s)  Forward FHT Time (s) 

Citeseer  $3327$  $99.58\%$  $1.93509$  $0.05276$  $0.05450$ 
Cora  $2708$  $98.84\%$  $0.86429$  $0.06908$  $0.05515$ 
Pubmed  $19717$  $99.84\%$  $62.67185$  $1.08775$  $1.55694$ 
VIC Citation Networks for node classification.
We test the model (16) on citation networks Citeseer, Cora and Pubmed [61], following the experimental setup of [62, 18]. The Citeseer, Cora and Pubmed are $6$, $7$ and $3$ classification problems with nodes $3327$, $2708$ and $19717$, edges $4732$, $5429$ and $44338$, features $3703$, $1433$ and $500$, and label rates $0.036$, $0.052$ and $0.003$ respectively.
In Table II, we compare the performance of the model (16) of HANet with methods Multilayer Perceptron (MLP), Manifold Regularization (ManiReg) [36], Semisupervised Embedding (SemiEmb) [63], Traditional Label Propagation (LP) [64], DeepWalk [65], Linkbased Classification (ICA) [66], Planetoid [62], ChebNet [17] and GCN [18]. We repeat the experiment $10$ times with different random seeds and report the average test accuracy of HANet. HANet has the top test accuracies on Cora and Pubmed and ranks second on Citeseer. Figure 4 shows the mean and standard deviation of validation accuracies and the validation loss up to epoch $200$ of HANet and GCN. HANet achieves slightly higher max accuracy as well as smaller standard deviation, and the loss also converges faster than GCN.
Haar basis and FHTs. In Figure 3a, we show the matrix of the Haar basis vectors for Cora, which has sparsity (i.e. the proportion of zero entries) $98.84\%$. The associated chain ${\mathcal{G}}_{10\to 0}$ has $2708$, $928$, $352$, $172$, $83$, $41$, $20$, $10$, $5$, $2$, $1$ nodes from level $10$ to $0$. Figure 3b shows the comparison of time for FHTs with direct matrix product. It illustrates that FHTs have nearly linear computational cost while the cost of matrix product grows at $\mathcal{O}({N}^{3})$ for a graph of size $N$. Figure 3c shows the comparison of time for generating the Haar basis and the basis for graph Laplacian. The Haar basis needs much less time than that for graph Laplacian. Table III gives the sparsity and the CPU time for generating Haar basis and FHTs on three datasets. All sparsity values for three datasets are very high (around $99\%$), and the computational cost of FHTs is proportional to $N$.
VII Conclusion
We introduce Haar convolution for GNNs, which has a fast implementation in view of the sparsity of the Haar basis matrix. This reduces the computational cost of graph convolution to linear complexity. Haar basis gives a sparse representation of graph data, meanwhile, the geometric property of the data is preserved. Haar convolution can act as an effective alternative of the convolution by graph Laplacian for GNNs, as illustrated by extensive experimental study on benchmarks. The models for HANet can possibly be improved and generalized. For example, training of HANet can be accelerated using neighbour sampling [28], importance sampling [20], or variance reduction [19]. Application of Haar convolution in other architectures is also expected.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
 [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [4] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
 [5] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 [6] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
 [7] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
 [8] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
 [9] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International Conference on Machine Learning, 2016, pp. 2014–2023.
 [10] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
 [11] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
 [12] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
 [13] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” arXiv preprint arXiv:1812.04202, 2018.
 [14] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review of relational machine learning for knowledge graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, 2015.
 [15] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” KnowledgeBased Systems, vol. 151, pp. 78–94, 2018.
 [16] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” arXiv preprint arXiv:1709.05584, 2017.
 [17] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016, pp. 3844–3852.
 [18] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” in ICLR, 2017.
 [19] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” in ICML, 2018, pp. 941–949.
 [20] J. Chen, T. Ma, and C. Xiao, “FastGCN: fast learning with graph convolutional networks via importance sampling,” arXiv:1801.10247, 2018.
 [21] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” in International Conference on Learning Representations, 2019.
 [22] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
 [23] K. K. Thekumparampil, C. Wang, S. Oh, and L.J. Li, “Attentionbased graph neural network for semisupervised learning,” arXiv preprint arXiv:1803.03735, 2018.
 [24] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 [25] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Accelerated filtering on graphs using lanczos method,” arXiv preprint arXiv:1509.04537, 2015.
 [26] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel, “Lanczosnet: Multiscale deep graph convolutional networks,” in International Conference on Learning Representations, 2019.
 [27] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” arXiv preprint arXiv:1902.07153, 2019.
 [28] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017, pp. 1024–1034.
 [29] J. Atwood and D. Towsley, “Diffusionconvolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
 [30] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” in CVPR, 2017, pp. 5425–5434.
 [31] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning. JMLR. org, 2017, pp. 1263–1272.
 [32] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
 [33] Z. Ma, M. Li, and Y. G. Wang, “Pan: Path integral based convolution for deep graph neural networks,” in Workshop on Learning and Reasoning with GraphStructured Representation. The 36th International Conference on Machine Learning, 2019.
 [34] A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathematische Annalen, vol. 69, no. 3, pp. 331–371, 1910.
 [35] I. Daubechies, Ten lectures on wavelets. SIAM, 1992, vol. 61.
 [36] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, no. Nov, pp. 2399–2434, 2006.
 [37] C. Chui, F. Filbir, and H. Mhaskar, “Representation of functions on big data: graphs and trees,” Applied and Computational Harmonic Analysis, vol. 38, no. 3, pp. 489 – 509, 2015.
 [38] G. H. Golub and C. F. Van Loan, Matrix computations. JHU press, 2012, vol. 3.
 [39] P. Indyk and M. Kapralov, “Sampleoptimal Fourier sampling in any constant dimension,” in 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE, 2014, pp. 514–523.
 [40] P. Indyk, M. Kapralov, and E. Price, “(nearly) sampleoptimal sparse fourier transform,” in Proceedings of the Twentyfifth Annual ACMSIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2014, pp. 480–499.
 [41] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Simple and practical algorithm for sparse fourier transform,” in Proceedings of the Twentythird Annual ACMSIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2012, pp. 1183–1194.
 [42] F. R. Chung and F. C. Graham, Spectral graph theory. American Mathematical Society, 1997, no. 92.
 [43] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
 [44] Y. G. Wang and X. Zhuang, “Tight framelets and fast framelet filter bank transforms on manifolds,” Applied and Computational Harmonic Analysis, 2018.
 [45] M. Gavish, B. Nadler, and R. R. Coifman, “Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning,” in ICML, 2010, pp. 367–374.
 [46] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
 [47] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
 [48] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Nearly optimal sparse fourier transform,” in 44th Annual ACM Symposium on Theory of Computing, STOC’12, 2012, pp. 563–577.
 [49] Q. Li, Z. Han, and X.M. Wu, “Deeper insights into graph convolutional networks for semisupervised learning,” arXiv:1801.07606, 2018.
 [50] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [51] L. C. Blum and J.L. Reymond, “970 million druglike small molecules for virtual screening in the chemical universe database GDB13,” Journal of the American Chemical Society, vol. 131, p. 8732, 2009.
 [52] M. Rupp, A. Tkatchenko, K.R. Müller, and O. A. von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Physical Review Letters, vol. 108, p. 058301, 2012.
 [53] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212, 2017.
 [54] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “MoleculeNet: a benchmark for molecular machine learning,” Chemical Science, vol. 9, no. 2, pp. 513–530, 2018.
 [55] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
 [56] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
 [57] C. Cortes and V. Vapnik, “Supportvector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
 [58] H. AltaeTran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data drug discovery with oneshot learning,” ACS Central Science, vol. 3, no. 4, pp. 283–293, 2017.
 [59] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, “Quantumchemical insights from deep tensor neural networks,” Nature Communications, vol. 8, p. 13890, 2017.
 [60] J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI1: an extensible neural network potential with DFT accuracy at force field computational cost,” Chemical Science, vol. 8, no. 4, pp. 3192–3203, 2017.
 [61] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. EliassiRad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, p. 93, 2008.
 [62] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semisupervised learning with graph embeddings,” in ICML, 2016, pp. 40–48.
 [63] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semisupervised embedding,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 639–655.
 [64] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semisupervised learning using gaussian fields and harmonic functions,” in ICML, 2003, pp. 912–919.
 [65] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in ACM SIGKDD. ACM, 2014, pp. 701–710.
 [66] Q. Lu and L. Getoor, “Linkbased classification,” in ICML, 2003, pp. 496–503.