Haar Transforms for Graph Neural Networks

  • 2019-07-10 15:22:37
  • Ming Li, Zheng Ma, Yu Guang Wang, Xiaosheng Zhuang
  • 5

Abstract

Graph Neural Networks (GNNs) have become a topic of intense research recentlydue to their powerful capability in high-dimensional classification andregression tasks for graph-structured data. However, as GNNs typically definethe graph convolution by the orthonormal basis for the graph Laplacian, theysuffer from high computational cost when the graph size is large. This paperintroduces the Haar basis, a sparse and localized orthonormal system for graph,constructed from a coarse-grained chain on the graph. The graph convolutionunder Haar basis --- the Haar convolution can be defined accordingly for GNNs.The sparsity and locality of the Haar basis allow Fast Haar Transforms (FHTs)on graph, by which a fast evaluation of Haar convolution between the graphsignals and the filters can be achieved. We conduct preliminary experiments onGNNs equipped with Haar convolution, which can obtain state-of-the-art resultsfor a variety of geometric deep learning tasks.

 

Quick Read (beta)

Haar Transforms for Graph Neural Networks

Ming Li, Zheng Ma, Yu Guang Wang, Xiaosheng Zhuang* M. Li is with the School of Information Technology in Education, South China Normal University, Guangzhou, China, and also with the Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3086, Australia. (e-mail: [email protected]).Z. Ma is with the Department of Physics, Princeton University, New Jersey, USA.(e-mail: [email protected]).Y.G. Wang is with the School of Mathematics and Statistics, The University of New South Wales, Sydney, Australia. (e-mail: [email protected]).X. Zhuang is with the Department of Mathematics, City University of Hong Kong, Hong Kong. (email: [email protected]).*  Corresponding author
Abstract

Graph Neural Networks (GNNs) have become a topic of intense research recently due to their powerful capability in high-dimensional classification and regression tasks for graph-structured data. However, as GNNs typically define the graph convolution by the orthonormal basis for the graph Laplacian, they suffer from high computational cost when the graph size is large. This paper introduces the Haar basis, a sparse and localized orthonormal system for graph, constructed from a coarse-grained chain on the graph. The graph convolution under Haar basis — the Haar convolution can be defined accordingly for GNNs. The sparsity and locality of the Haar basis allow Fast Haar Transforms (FHTs) on graph, by which a fast evaluation of Haar convolution between the graph signals and the filters can be achieved. We conduct preliminary experiments on GNNs equipped with Haar convolution, which can obtain state-of-the-art results for a variety of geometric deep learning tasks.

Graph Neural Networks, Haar basis, Graph Convolution, Geometric Deep Learning.

I Introduction

Convolutional neural networks (CNNs) have been a very successful machinery in many high-dimensional regression and classification tasks on Euclidean domains [1, 2]. Recently, its generalization to non-Euclidean domains, known as geometric deep learning, has attracted growing attention, due to its great potential in pattern recognition and regression for graph-structured data, see [3].

Graph neural networks (GNNs) are a typical model in geometric deep learning, which replaces the partial derivatives in CNNs by the Laplacian operator [4, 5].

(a) Haar basis matrix (b) Haar Convolution
Fig. 1: (a) The 8×8 matrix Φ of the Haar Basis for a graph 𝒢 with 8 nodes. The green entries are zero and the matrix Φ is sparse. The Haar basis is created based on the coarse-grained chain 𝒢20:=(𝒢2,𝒢1,𝒢0), where 𝒢2,𝒢1,𝒢0 are graphs with 8,4,2 nodes. For j=1,2, each node of 𝒢j-1 is a cluster of nodes in 𝒢j. Each column is a member of the Haar basis. The first two columns can be reduced as an orthonormal basis of 𝒢0 and the first to fourth columns can be reduced to the orthonormal basis for 𝒢1. (b) Haar Convolution gf using the Haar basis of (a), where the weight sharing for filter vector g is defined by the chain 𝒢20 and the gf is the forward Haar transform of the point-wise product of the adjoint Haar transform of g and f, where the Haar transforms have a fast algorithmic implementation.

The Laplacian, which carries the structural features of the data, is a second-order isotropic differential operator that admits a natural generalization to graphs and manifolds. In GNNs, input signals are convoluted with filters under an orthonormal system for the Laplacian. However, as the algebraic properties of regular Euclidean grids are lost in general manifolds and graphs, FFTs for the Laplacian are not available. This leads to the issue that the computation of convolution for graph signal is not always efficient, especially when the graph dataset is large.

In this paper, we introduce an alternative orthonormal system on graph, the Haar basis. It then defines a new graph convolution for GNNs — Haar convolution. Due to the sparsity and locality of the Haar basis, fast Haar transforms (FHTs) can be achieved on graph-structured data. This would significantly improve the computational efficiency of GNNs as the Haar convolution guarantees the linear computational complexity. We apply Haar convolution to GNNs and give a novel type of deep convolutional neural networks on graph — HANet. Numerical tests on real graph datasets show that HANet achieves good performance and computational efficiency in classification and regression tasks. The major contributions of the paper are as follows.

  • The Haar basis is introduced for graph. Both theoretical analysis and real examples of the sparsity and locality are given. With these properties, the fast algorithms for Haar transforms (FHTs) are proposed.

  • The Haar convolution under the Haar basis is given. By FHTs, the computational cost for the Haar convolution is proportional to the size of the data on the graph. Using Haar basis for Haar convolution, the comprehensive geometric information of the graph data can be well preserved.

  • GNN with Haar convolution (HANet) is proposed. We demonstrate experimentally that HANet provides great efficiency and achieves good performance on a broad range of high-dimensional regression and classification problems on graphs.

II Related Work

Graph Neural Networks. Developing deep neural networks for graph-structured data has received extensive attention in recent years [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Bruna et al. [4] first propose graph convolution, which is defined by graph Fourier transforms under the orthogonal basis of the graph Laplacian. The graph convolution uses Laplacian eigendecomposition which is computationally expensive. Defferrard et al. [17] approximate smooth filters in the spectral domain by Chebyshev polynomials. Kipf and Welling [18] simplify the convolutional layer by exploiting first-order Chebyshev polynomial for filters. Following this line, several acceleration methods for graph convolutional networks are proposed [19, 20]. Graph wavelet neural networks [21] replace graph Fourier transform by graph wavelet transform in the graph convolution, where Chebyshev polynomials are used to approximate the graph wavelet basis [22]. Although GWNN circumvents the Laplacian eigendecomposition, the matrix inner-product operations are nevertheless not avoidable in wavelet transforms for convolution computation. Graph convolutional networks with attention mechanisms [23, 24] can effectively learn the importance between nodes and its neighbors, which is more suitable for node classification task rather than graph regression ones. But much computational and memory cost is requested to perform the attention mechanism in the convolutional layers. Some GNN models [25, 26, 27] use multi-scale information and higher order adjacency matrix to define graph convolution. To increase the scalability of the model for large-scale graph, Hamilton et al. [28] propose the framework Graph-SAGE with sampling and a neural network based aggregator over a fixed size node neighbor. Diffusion convolutional neural networks [29] is developed by using diffusion operator for graph convolution. MoNet [30] introduces a general methodology to define spatial-based graph convolution by the weighted average of multiple weighting functions on neighborhood. [31] provides a unified framework, the Message Passing Neural Networks (MPNNs), by which some existing GNN models are incorporated. Xu et al. [32] present a theoretical analysis for the expressive power of GNNs and propose a simple but powerful variation of GNN, the graph isomorphism network. In [33], the authors propose the path integral based convolution for GNNs (PAN) which generalizes the graph Laplacian to maximal entropy transition matrix derived from a path integral formalism.

Haar Basis and Fast Algorithms. Haar basis rooted in the theory of Haar wavelet basis as first introduced by Haar [34], is a special case of Daubechies wavelets [35], and later developed onto graph by Belkin et al. [36], see also [37]. The fast computation for sparse matrix multiplication is well-studied (see e.g. [38]). The sparse Fourier transforms by which sparse Haar transforms can be developed are seen in [39, 40, 41].

III Graph convolution with Haar basis

III-A Graph Fourier transform

Bruna et al. [4] first define the graph convolution based on spectral graph theory [42]. An un-directed weighted graph 𝒢 is a triplet (V,E,w) with vertices V, edges E and weights w:E. Let l2(𝒢):={f:V|vV|f(v)|2<} the real-valued l2 space on the graph with inner product fg:=vVf(v)g(v). The (normalized) eigenvectors {u}=1|V| of the graph Laplacian forms an orthonormal basis for l2(𝒢). Denoted by N:=|V| the number of vertices of the graph and let U:=(u1,,uN) be the (Fourier) base matrix whose columns form the Fourier basis for l2(𝒢). The graph convolution is given by

gf=U((UTg)(UTf)), (1)

where UTf is the adjoint discrete Fourier transform of f, Uc is the forward discrete Fourier transform of c on 𝒢, is the element-wise Hadamard product.

While graph convolution defined in (1) is conceptually important, it has some limitations in practice. First, the base matrix U is obtained by using eigendecomposition of the graph Laplacian in the sense that =UΛUT, where Λ is the diagonal matrix of corresponding eigenvalues. The computational complexity is proportional to O(N3), which is impractical when the number of vertices of the graph is quite large. Second, the computation of the forward and inverse graph Fourier transforms (e.g. UTf and Uc) have O(N2) computational cost due to the multiplication by (dense) matrices U and UT. In general, there is no fast algorithms for the graph Fourier transforms. Third, filters in the spectral domain cannot guarantee the localization in the spatial (vertex) domain, and O(Ndm) parameters need to be tuned in the convolutional layer with m filters (hidden nodes) and d features for each vertex.

To alleviate the cost of computing the graph Fourier transform, Chebyshev polynomials [17] are used to construct localized polynomial filters for graph convolution, which is called ChebNet. Kipf and Welling [18] further simplify the ChebNet to obtain graph convolutional networks (GCNs). However, such a polynomial-based approximation strategy may lose some information in the spectral graph convolutional layer, and matrix multiplication is still not avoidable as FFTs is not available for graph convolution. Thus, the graph convolution in this scenario is also computationally expensive, especially for dense graph of large size. We propose an alternative orthonormal basis that allows fast computation for the corresponding graph convolution, which then improves the scalability and efficiency of existing graph models. The basis we use is the Haar basis on a graph. The Haar basis replaces the matrix of eigenvectors U in (1) and forms a highly sparse matrix, which reflects the clustering information of the graph. The sparsity of the Haar transform matrix allows fast computation (in nearly linear computational complexity) of the corresponding graph convolution.

III-B Haar basis

A basis for l2(𝒢) is a set of vectors {ϕ}=1N on 𝒢 which are linearly independent and orthogonal (i.e. ϕϕ=0 if ). The construction of the Haar basis exploits a chain of the graph. For a graph 𝒢=(V,E,w), a graph 𝒢cg:=(Vcg,Ecg,wcg) is called a coarse-grained graph of 𝒢 if |Vcg||V| and each vertex of 𝒢 has a parent in 𝒢cg. Each vertex of 𝒢cg is called a cluster of 𝒢. Let two integers J0,J, J>J0. A coarse-grained chain for 𝒢 is a set of graphs 𝒢J0J:=(𝒢J0,𝒢J0+1,,𝒢J) if 𝒢J=𝒢 and 𝒢j is a coarse-grained graph of 𝒢j-1 for j=J,J-1,,J0+1, and 𝒢J0 the top level or the coarsest level and 𝒢J the bottom level or the finest level. If the top level 𝒢J0 of the chain has only one node, 𝒢J0J becomes a tree. The chain 𝒢J0J gives a hierarchical partition for the graph 𝒢. For details about graphs and chains, refer to examples in [42, 43, 37, 44].

Construction of Haar basis. With a chain of the graph, one can generate a Haar basis for l2(𝒢) following [37], see also [45]. We show the construction of Haar basis on 𝒢, as follows.

Step 1. Let 𝒢cg=(Vcg,Ecg,wcg) be a coarse-grained graph of 𝒢. Each vertex vcgVcg is a cluster of 𝒢. We define Ncg vectors ϕcg on 𝒢cg by, for ucgVcg,

ϕ1cg(ucg)=𝟏(ucg)/Ncg, (2)

and for =2,,Ncg,

ϕcg(ucg) (3)
=Ncg-+1Ncg-+2(χ-1cg(ucg)-j=Ncgχjcg(ucg)Ncg-+1),

where Ncg is the number of vertices of 𝒢cg, and χjcg is the indicator function for the jth vertex ujcgVcg on 𝒢 given by

χjcg(vcg):={1,vcg=ujcg,0,vcg𝒢cg\{ujcg}.

Then, {ϕcg}=1Ncg forms an orthonormal basis for l2(𝒢cg). For each =1,,Ncg, we extend ϕcg to a vector on 𝒢 by

ϕ,1(v):=ϕcg(vcg)|vcg|,vvcg,=1,,Ncg,

here |vcg| means the number of vertices in 𝒢 whose common parent is vcg. For =1,,Ncg, ucg is a cluster of 𝒢 and can be viewed as a subset of V. We let k:=|ucg| be the number of vertices in the cluster ucg, for which we label the vertices by

ucg={v,1,,v,k}V.

For k=2,,k,

ϕ,k=k-k+1k-k+2(χ,k-1-j=kkχ,jk-k+1).

where χ,j for j=1,,k,

χ,j(v):={1,v=v,j,0,v𝒢\{v,j}.

The resulting {ϕ,k:=1,,Ncg,k=1,,k} is an orthonormal basis for l2(𝒢).

Step 2. Let 𝒢JJ0 be a coarse-grained chain for the graph 𝒢. An orthonormal basis {ϕ(0)}=1N0 for l2(𝒢J0) is generated using (2) and (3). We then repeatedly use Step 1: for j=J0+1,,J, we generate an orthonormal basis {ϕ(j)}=1Nj for l2(𝒢j) from the orthonormal basis {ϕ(j-1)}=1Nj-1 for the coarse-grained graph 𝒢j-1 that was derived in the previous steps. We call the function at the finest level ϕ:=ϕ(J), =1,,N the Haar global orthonormal basis or simply the Haar basis for 𝒢 associated with the chain 𝒢JJ0. The orthonormal basis {ϕ(j)}=1Nj for l2(𝒢j), j=J-1,J-2,,J0 is called the associated (orthonormal) basis for the Haar basis {ϕ}=1N.

Proposition 1.

For each level j=J0,,J, the {ϕ(j)}=1Nj is an orthonormal basis for l2(Gj), and in particular, {ϕ}=1N is an orthonormal basis for l2(G); each basis {ϕ(j)}=1Nj is the Haar basis for the chain GjJ0.

Proposition 2.

Let GJJ0 be a chain. If each parent of level Gj, j=J-1,J-2,,J0, contains at least two children, the number of different values of the Haar basis ϕ, =1,,N, is bounded by a constant.

The Haar basis depends on the chain for the graph. If the topology of the graph is well reflected by the clustering of the chain, the Haar basis then contains the crucial geometric information of the graph. For example, by using k-means clustering algorithm [46] or METIS algorithm [47] one can generate a chain that reveals desired geometric properties of the graph.

Figure 1b shows a chain 𝒢20 with 3 levels of a graph 𝒢. Here, for each level, the vertices are given by

V(2) =V={v1,,v8},
V(1) ={v1(1),v2(1),v3(1),v4(1)}
={{v1,v2},{v3,v4},{v5,v6},{v7,v8}},
V(0) ={v1(0),v2(0)}={{v1(1),v2(1)},{v3(1),v4(1)}}.

Figure 1a shows the Haar basis for the chain 𝒢20. There are in total 8 vectors of the Haar basis for 𝒢. From construction, the Haar basis ϕ and the associated basis ϕ(j), j=1,2 are closely connected: the ϕ1,ϕ2 can be reduced to ϕ1(2),ϕ2(2) and the ϕ1,ϕ2,ϕ3,ϕ4 can be reduced to ϕ1(1),ϕ2(1),ϕ3(1),ϕ4(1). This connection would allow fast algorithms for Haar transforms as given in Algorithms 1 and 2. In Figure 1, the matrix ΦT of the 8 Haar basis vectors ϕ on 𝒢 has good sparsity. With the increase of the graph size, the sparsity of the Haar basis matrix Φ becomes more prominent. We will later demonstrate this finding in our experimental study.

III-C Haar convolution

With the Haar basis constructed in Section III-B, we can define Haar convolution as an alternative form of spectral graph convolution in (1). Let {ϕ}=1N be the Haar basis associated with a chain 𝒢JJ0 of a graph 𝒢. Denoted by Φ=(ϕ1,,ϕN)RN×N the Haar transform matrix. We define by

ΦTf=(vVϕ1(v)f(v),,vVϕN(v)f(v)) (4)

the adjoint Haar transform for the signal f on 𝒢, and by

(Φc)(v)==1Nϕ(v)c,vV, (5)

the forward Haar transform for (coefficients) vector c:=(c1,,cN)N. We call the matrix Φ Haar transform matrix.

Definition 3.

The Haar convolution for a filter g and a signal f on G can be defined as

gf=Φ((ΦTg)(ΦTf)). (6)

Computationally, (6) is obtained by performing forward Haar transform of the element-wise Hadamard product between adjoint Haar transform of g and f. Compared with the Laplacian based spectral graph convolution given in (1), the Haar convolution has several characteristics: (i) the Haar transform matrix Φ is sparse so that the computation of ΦTf or Φc is more efficient than UTf or Uc; (ii) as the Haar basis is constructed based on the chain of the graph which reflects the clustering property for vertices, the Haar convolution can potentially extract abstract features (localized components) over f, which can be viewed as the learning representation for graph signal. From this perspective, Haar convolution might be more suitable for graph-level based modelling tasks (e.g. graph classification/regression) than node-level ones (e.g. semi-supervised node classification); (iii) by means of the sparsity of the Haar basis, adjoint and forward Haar transforms can be implemented by the fast algorithms that have nearly linear computational complexity (with respect to the size of f).

We can change the execution of ADHTs on the filter g (e.g. ΦTg) and regard the filter as defined in the frequency domain, and can thus refine Haar convolution as gf=Φ(g(ΦTf)).

Fast algorithms for Haar tranforms. The computation of Haar transforms can also be accelerated by using sparse matrix multiplications due to the sparsity of the Haar transform matrix. This would allow the linear computational complexity O(ϵN) with sparsity 1-ϵ of the Haar transform matrix. Moreover, a similar computational strategy to the sparse Fourier transforms [41, 48] can be used to have the Haar transforms achieve an even faster algorithm with complexity O(klogN) for graph with N nodes and Haar transform matrix with k non-zero elements. By the sparsity of Haar transform matrix, fast Haar transforms (FHTs) which includes adjoint Haar transform and forward Haar transform can be developed to speed up the implementation of Haar convolution. Theorems 4 and 5 in the following section show that the computational cost of adjoint and forward Haar transform can reach 𝒪(N) and 𝒪(N(logN)2). They are nearly linear computational complexity and are then called fast Haar transforms (FHTs). Meanwhile, the Haar convolution can be evaluated by FHTs in 𝒪(N(logN)2) steps.

Weight sharing. We can use weight sharing in Haar convolution to reduce the number of parameters of the filter, and capture the common feature of the nodes which are in the same cluster. As the clustering contains the information of the neighbourhood, we can use the chain 𝒢JJ0 for weight sharing: the vertices of the graph which have the same parent at a coarser level share a parameter of the filter. Here, the coarser level is some fixed level J1, J0J1<J. For example, the weight sharing rule for chain 𝒢20 in Figure 1b is: assign the weight gi for each node vi(0), i=1,2 on the top level, the filter (or the weight vector) at the bottom level is then g=(g1,g1,g1,g1,g2,g2,g2,g2). In this way, we can use the filter g with two independent parameters g1,g2 to convolute with the input vector with 8 components.

IV Fast algorithms under Haar basis

For the Haar convolution introduced in Definition 6, we can develop an efficient computational strategy by virtue of the sparsity of the Haar basis, as mentioned. Let 𝒢JJ0 be a chain of the graph 𝒢. For convenience, we label the vertices of the level-j graph 𝒢j by Vj:={v1(j),,vNj(j)}.

Fast computation for adjoint Haar transform ΦTf. The adjoint Haar transform in (4) can be computed in the following way. For j=J0,,J-1, let ck(j) be the number of children of vk(j), i.e. the number of vertices of 𝒢j+1 which belongs to the cluster vk(j), for k=1,,Nj. For j=J, let ck(J)1 for k=1,,N. For j=J0,,J and k=1,,Nj, we define the weight factor for vk(j) by

wk(j):=1/ck(j). (7)

Let WJJ0:={wk(j):j=J0,,J,k=1,,Nj}. Then, the weighted chain (𝒢JJ0,WJJ0) is a filtration if each parent in the chain 𝒢JJ0 has at least two children. See e.g. [37, Definition 2.3].

Let {ϕ}=1N be the Haar basis for the filtration (𝒢JJ0,WJJ0) of a graph 𝒢. We define the weighted sum for fl2(𝒢) by

𝒮(J)(f,vk(J)):=f(vk(J)),vk(J)𝒢J, (8)

and for j=J0,,J-1 and vk(j)𝒢j,

𝒮(j)(f,vk(j)):=vk(j+1)vk(j)wk(j+1)𝒮(j+1)(f,vk(j+1)). (9)

For each vertex vk(j) of 𝒢j, the 𝒮(j)(f,vk(j)) is the weighted sum of the 𝒮(j+1)(f,vk(j+1)) at the level j+1 for those vertices vk(j+1) of 𝒢j+1 whose parent is vk(j).

The adjoint Haar transform can be evaluated by the following theorem.

Theorem 4.

Let {ϕ}=1N be the Haar basis for the filtration (GJJ0,WJJ0) of a graph G. Then, the adjoint Haar transform for the vector f on the graph G can be computed by, for =1,,N,

(ΦTf)=k=1Nj𝒮(j)(f,vk(j))wk(j)ϕ(j)(vk(j)), (10)

where ϕ(j) is the th member of the orthonormal basis for l2(Gj) associated with the Haar basis ϕ (see Section III-B), vk(j) are the vertices of Gj and the weight factors wk(j) are given by (7).

Proof.

By the relation between ϕ and ϕ(j),

(ΦTf) =k=1Nf(vk(J))ϕ(vk(J))
=k=1NJ-1(vk(J)vk(J-1)f(vk(J)))wk(J-1)ϕ(J-1)(vk(J-1))
=k=1NJ-1𝒮(J-1)(f,vk(J-1))wk(J-1)ϕ(J-1)(vk(J-1))
=k′′=1NJ-2(vk(J-1)vk′′(J-2)𝒮(J-1)(f,vk(J-1))wk(J-1))
×wk′′(J-2)ϕ(J-2)(vk′′(J-2))
=k′′=1NJ-2𝒮(J-2)(f,vk′′(J-2))wk′′(J-2)ϕ(J-2)(vk′′(J-2))
=k=1Nj𝒮(j)(f,vk(j))wk(j)ϕ(j)(vk(j)),

thus completing the proof. ∎

Fast computation for forward Haar transform Φc. The forward Haar transform in (5) can be computed, as follows.

Theorem 5.

Let {ϕ}=1N be the Haar basis for a filtration (GJJ0,WJJ0) of the graph G and {ϕ(j)}=1Nj, j=J0,,J is the associated bases at Gj. Then, the forward Haar transform for vector c=(c1,,cN) in RN can be computed by, for k=1,,N,

(Φc)k=j=1JWk(j)(=Nj-1+1Njcϕ(j)(vkj(j))),

where for k=1,,N, vkj(j) is the parent of vk(J) at level j, and Wk(J):=1 and

Wk(j):=n=2jwkn(n) for j=J0,,J-1, (11)

where the weight factors wkn(n) for n=1,,J are given by (7).

Proof.

Let Nj:=|Vj| for j=J0,,J and NJ0-1:=0. For k=1,,NJ, let vk(J) the kth vertex of 𝒢J. For i=J0,,J-1, there exists ki=1,,Nj such that vki(i) the parent at level i of vk(J). By the property of the Haar basis, for each eigenvector ϕ there exists j=J0,,J such that =Nj-1+1,,Nj, ϕ is a constant for the vertices of 𝒢J=𝒢 who have the same parent at level j. Then,

ϕ(vk(J)) =wkJ-1(J-1)ϕ(J-1)(vkJ-1(J-1))
=wkJ-1(J-1)wkJ-2(J-2)ϕ(J-2)(vkJ-2(J-2))
=(n=J0jwkn(n))ϕ(j)(vkj(j))
=Wk(j)ϕ(j)(vkj(j)). (12)

where the product of the weights in the third equality only depends upon the level j and the vertex vk(1), and we have let

Wk(j):=n=1jwkn(n).

By (IV),

Φ(c,vk(J)) ==1Ncϕ(vk(J))
=j=J0J=Nj-1+1Njcϕ(vk(J))
=j=J0J=Nj-1+1NjcWk(j)ϕ(j)(vkj(j))
=j=J0JWk(j)(=Nj-1+1Njcϕ(j)(vkj(j))),

thus completing the proof. ∎

\[email protected]@algorithmic \STATEInput: A real-valued vector f=(f1,,fN) on the graph 𝒢; the Haar basis {ϕ}=1N for l2(𝒢) with the chain 𝒢JJ0 and the associated basis {ϕ(j)}=1Nj for l2(𝒢j). \STATEOutput: ADFT ΦTf in (4) under the basis {ϕ}=1N.
  1. 1.

    Evaluation of the following sums for j=J0,,J-1, using (8) and (9) recursively.

    𝒮(j)(f,vk(j)),vk(j)Vj. (13)
  2. 2.

    For each , let j be the integer such that Nj-1+1Nj, where NJ0-1:=0. Evaluating

[1mm] k=1Nj𝒮(j)(f,vk(j))wk(j)ϕ(j)(vk(j))
[1.5mm] in (10) by the following two steps.
[2mm] (a) Compute the product for all vk(j)Vj:
[1mm] T(f,vk(j))=𝒮(j)(f,vk(j))wk(j)ϕ(j)(vk(j)). (b) Evaluate sum k=1NjT(f,vk(j)).
\algorithmcfname 1 Fast Haar Transforms: Adjoint
\[email protected]@algorithmic \STATEInput: A real-valued vector c=(c1,,cN) on graph 𝒢; the Haar basis {ϕ}=1N for l2(𝒢) associated with the chain 𝒢JJ0 and the associated orthonormal basis {ϕ(j)}=1Nj for l2(𝒢j). \STATEOutput: DFT Φc in (5) under the basis {ϕ}=1N.
  1. 1.

    For each , let j be the integer such that Nj-1+1Nj, where NJ0-1:=0. For all k=1,,Nj, compute the product

[1mm] t(c,vk(j)):=cϕ(j)(vk(j)).
  • 2.

    For each j=J0,,J, evaluate the sums

  • [1mm] s(c,vkj(j)):==Nj-1+1Njt(c,vkj(j)).
  • 3.

    Compute the Wk(j) for k=1,,N and j=J0,,J-1 by (11).

  • 4.

    Compute the weighted sum

  • [1mm] (Φc)k=j=J0JWk(j)s(c,vkj(j)),k=1,,N.
    \algorithmcfname 2 forward Haar transform (DFTs)

    Computational Complexity Analysis. Algorithm 1 gives the computational steps for the evaluation of (ΦTf), =1,,N in Theorem 4. In the first step of Algorithm 1, the total number of summations to compute all elements of (13) is no more than i=0j-1Ni+1; In the second step, the total number of multiplication and summation operations is at most 2=1NC=𝒪(N). Here C is the constant which bounds the number of distinct values of the Haar basis (see Proposition 2). Thus, the computational steps of Algorithm 1 are 𝒪(N).

    By Theorem 5, the evaluation of the forward Haar transform Φc can be implemented by Algorithm 2. In the first step of Algorithm 2, the number of multiplications is no more than =1NC=𝒪(N); in the second step, the number of summations is no more than =1NC=𝒪(N) as well; in the third step, the computational steps are 𝒪(N(logN)2); in the last step, the total number of summations and multiplications is 𝒪(NlogN). Thus, the total computational steps are 𝒪(N(logN)2).

    Hence, Algorithms 1 and 2 have linear computational cost (up to a logN term). We call these two algorithms fast Haar transforms (FHTs) under Haar basis on the graph.

    Proposition 6.

    The ADFTs and DFTs in Algorithms 1 and 2 are invertible in that for any vector f on graph G,

    f=Φ(ΦTf).

    Proposition 6 shows that DFTs can recover the graph signal f from the forward Haar transform ΦTf. This means that forward and adjoint Haar transforms have zero-loss in graph signal transmission.

    The Haar convolution can be evaluated fast by FHTs in Algorithms 1 and 2. We show the computational steps for Haar convolution in Algorithm 3. From the above discussion, the computational complexity of Algorithm 3 is 𝒪(N(logN)2). We call Algorithm 3 the fast Haar convolution (algorithm). This can be used in graph neural networks which we introduce now.

    \[email protected]@algorithmic \STATEInput: Real-valued vectors g:=(g1,,gN) and f=(f1,,fN) on 𝒢; chain 𝒢J0J of graph 𝒢 where 𝒢J:=𝒢. \STATEOutput: Haar convolution gf of g and f as given by Definition 6.
    1. Compute ADFTs ΦTg and ΦTf by Algorithm 1.
    2. Compute the pointwise product of ΦTg and ΦTf.
    3. Compute DFT of (ΦTg)(ΦTf) by Algorithm 2.
    \algorithmcfname 3 Fast Haar Convolution

    V Graph neural networks with Haar transforms

    V-A Models

    The Haar convolution of (1) can be applied to all architectures of graph neural networks. For graph classification and regression tasks, we can apply the model with convolutional layer consisting of m-hidden neutrons and a non-linear activation function σ (e.g. ReLU): for i=1,2,m,

    fiout =σ(j=1dΦ(gi,j(ΦTfjin)))
    =σ(j=1dΦGi,jΦTfjin), (14)

    for input graph data Fin=(f1in,f2in,,fdin)RN×d with N nodes and d input features (for each vertex). Here, the feature fjin of the input signal is convolved with the learnable filter gi,jN by Haar transforms, and then all Haar-transformed features are combined as a new feature fiout. This gives the output matrix Fout=(f1out,f2out,,fmout)N×m. If we write Gi,jN×N as the diagonal matrix of filter gi,j, the convolutional layer has the compact form of the second equality in (V-A). We call the GNNs with Haar convolution in (V-A) HANet.

    Weight detaching. For each layer, O(Ndm) parameters need to be tuned. To reduce the number of parameters, we can replace the filter matrix Gi,j by a unified diagonal filter matrix G and a compression matrix WRd×m (which is a detaching approach used in conventional CNN for extracting features). This then leads to a concise form

    F^=σ(Φ(G(ΦTF))W). (15)

    Then, it requires O(N+dm) parameters to train. Recall that constructing the Haar basis uses a chain 𝒢JJ0 for the graph 𝒢, one can implement weight sharing based on the same chain structure. Specifically, k-means clustering algorithm [46] or METIS algorithm [47] can be used to generate a chain that reveals desired geometric properties of the graph. Suppose we consider a coarser level J1 (J0J1<J) having K clusters, then all vertices in the same cluster share the common filter parameter. Similarly, the corresponding children vertices in level J1-1 share the same filter parameters as used in their parent vertices, until the bottom level corresponds to the whole set of vertices of the input graph. Thus, the parameter complexity is reduced to 𝒪(K+dm).

    The HANet can be evaluated by performing d-times fast Haar convolutions (consisting of d-times adjoint and forward Haar transforms). The total computational cost is 𝒪(N(logN)2d). Deep GNNs with Haar convolution can be built by stacking multiple Haar convolutional layers of (15), followed by an output layer.


    (a) HANet architecture (b) Weight sharing and graph pooling
    Fig. 2: (a) HANet with multiple Haar convolutional layers and then fully connected by softmax. (b) Weight sharing for Haar convolution and graph coarsening for graph pooling for the chain 𝒢20.

    HANet for graph classification and regression. Graph classification and regression can be formulated as supervised learning. Given a collection of graph-structured data {fi}i=1n with labels {yi}i=1n, the objective of the classification task is to find a mapping that can classify or predict labels. The model of HANet uses a similar architecture of deep convolutional neural network which has several Haar convolutional layers and fully connected dense layers. Figure 2a shows the flowchat of HANet with multiple Haar convolutional layers: the chain 𝒢JJ0 and the Haar basis ϕ and the associated basis ϕ(j), j=J0,,J are pre-computed; graph-structured input f is Haar-convoluted with filter g which is of length N but with NJ-1 independent parameters, where g is expanded from level J-1 to J by weight sharing, and the output fout of the first layer is the ReLU of the Haar convolution of g and f; the graph pooling reduces fout of size NJ to f~in of size NJ-1; and in the second Haar convolutional layer, the input is f~in and the Haar basis is ϕ(J-1); the following layers continue this process; the final Haar convolutional layer is fully connected by one or multiple dense layers. For classification, the softmax function is applied to the last dense layer.

    HANet for node classification. In node classification, the whole graph is the only single input, where a small proportion of nodes are labeled. The output is the graph with all unknown labels predicted. We can use the network with two layers as

    HANet(fin):=softmax(HC(2)(ReLU(HC(1)(fin)))) (16)

    where HC(1) and HC(2) are the Haar convolutional layers

    HC(i)(f):=A^(w1(i)f)w2(i),i=1,2,

    where we use the modified Haar convolution w1(i)f=Φ(w1(i)(ΦTf)). For a graph with N nodes and M features, in the first Haar convolutional layer, the filter w1(1) contains N0×M parameters and is extended to a matrix N×M by weight sharing, where N0 is the number of nodes at the coarsest level. The w2(1) plays the role of weight compression and feature extraction. The first layer is activated by the rectifier and the second layer is fully connected with softmax. The A^, which is defined in [18], is the square matrix of size N determined by the adjacency matrix of the input graph. This smoothing operation compensates the information lost in coarsening by taking a weighted average of features of each vertex and its neighbours. For vertices that are densely connected, it makes their features more similar and significantly improves the ease of node classification task [49].

    V-B Technical components

    Fast computation for HANet. Complexity analysis of FHTs in the previous section shows that HANet is more efficient than GNNs with Fourier basis. The graph convolution of the latter will inccur 𝒪(N3) computational steps. Various strategies are proposed to improve the computational performance for graph convolution. For example, ChebNet [17] and GCN [18] use localized polynomial approximation for the spectral filters; GWNN [21] constructs sparse and localized graph wavelet basis matrix for graph convolution. These methods implement the multiplication between a sparse matrix (e.g. the refined adjacency matrix A^ in GCN or the wavelet basis matrix ψs in GWNN [21]) and input matrix F in the convolutional layer. However, to compute either A^F or ψsF, the computational complexity to a great extent relies on the sparse degree of A^ or ψs, i.e. roughly proportional to 𝒪(εN2d), where ε, ε[0,1], represents the percentage of non-zero elements in a square matrix. The 𝒪(εN2d) may be significantly higher than 𝒪(N(logN)2d) as long as ε is not extremely small, indicating that our FHTs usually outperform these methods especially when N is quite large and ε1. Moreover, sparse matrix multiplication or sparse FHTs as mentioned can be used to improve the computational speed of Haar convolution.

    Chain. In HANet, the chain and the Haar basis can be pre-computed since the graph structure is already known. In particular, the chain is computed by a modified version of the METIS algorithm [47], which fast generates a chain for the weight matrix of a graph. In general, the parents of a chain from METIS have at least two children, and then the weighted chain is a filtration and the condition of Proposition 2 is satisfied.

    Weight sharing for filter. In the HANet one can use weight sharing given in Section III-C for filters. By doing this, we exploit the local geometry of the graph-structured data to extract the common feature of neighbour nodes and meanwhile reduce the independent parameters of the filter. Weight sharing can take place in each convolutional layer of HANet. For chain 𝒢JJ0 with which the Haar basis is associated, weight sharing can act from the coarsest level J0 to the finest level J or from any level coarser than J to J. For a filtration, the weight sharing could shrink the number of parameters by rate 2-(J-J0) at least, see Figure 2b.

    Graph pooling. We use max graph pooling between two convolutional layers of HANet. Each pooled input is the maximum of the previous layer whose locations are at the children of the current level. The pooling uses the same chain as the Haar basis at the same layer. For example, after pooling, the second layer uses the chain 𝒢(J-1)J0, as illustrated in Figure 2. By the construction of Haar basis in Section III-B, the new Haar basis associated with 𝒢(J-1)J0 is exactly the pre-computed basis {ϕ(J-1)}=1NJ-1.

    VI Experiments

    In this section, we test the proposed graph neural networks with Haar transforms on MNIST (graph signal classification), Quantum Chemistry (graph regression) and Citation Networks (node classification). The experiments for graph classification were carried out under the Google Colab environment with Tesla K80 GPU while for node classification were under the unix environment with a 3.3GHz Intel Core i7 CPU and 16GB RAM. All the methods were implemented in TensorFlow. SGD+Momentum and Adam optimization methods were used in the experiments.

    VI-A MNIST for graph signal classification.

    We treat the MNIST digits classification as a learning problem on graphs, following [17, 30], where each pixel of an image with resolution 28×28 is one of the 784 vertices of the graph. Of 70,000 images, we use 55,000 for training, 5,000 for validation and the remaining 10,000 for test. We compare HANet against GNN with graph Laplacian [4] and ChebNet [17], all with LeNet-5-like architecture [50]. The edges are determined by the spatial relation between vertices. A vertex has an edge to the 8 neighbours and has no edge to other vertices. The task is to recognize an image of hand-written digit as one of the ten digits 0,1,,9. We use similar hyper-parameter setting from ChebNet11 1 https://github.com/mdeff/cnn_graph: dropout probability 0.5, regularization weight 2×10-4, initial learning rate 0.02, learning rate decay 0.95, momentum 0.9. The chain for Haar basis is 𝒢50. The finest level 𝒢5=𝒢 has 784 nodes and the graphs at the following levels have 218, 64, 18, 6 and 3 nodes. For weight sharing and graph pooling we use the chain 𝒢(J-k+1)J-k for the kth convolutional layer. The test accuracy of HANet for MNIST is 98.60%, which is close to 99.17% of ChebNet and higher than 96.26% of the GNN with graph Laplacian. This shows that HANet is able to learn complicated big graph data in graph signal classification where the Haar convolution preserves the geometric information of data.

    VI-B Quantum Chemistry for graph regression.

    We test HANet on QM7 [51, 52]. The QM7 contains 7165 molecules, each of which is represented by the Coulomb (energy) matrix and labeled with atomization energy. We treat each molecule as a weighted graph where the nodes are the atoms and the adjacency matrix is the 23×23-Coulomb matrix of the molecule, where the true number of atoms may be less than 23. As in most cases the adjacency matrix is not full ranked, we take the average of the Coulomb matrices of all molecules as the common adjacency matrix, for which we generate the Haar basis. To avoid exploding gradients in parameter optimization, we take the standard score of each entry over all Coulomb matrices as input.

    TABLE I: Test mean absolute error (MAE) comparison on QM7
    Method Test MAE
    RF 122.7±4.2
    Multitask 123.7±15.6
    KRR 110.3±4.7
    GC 77.9±2.1
    Multitask(CM) 10.8±1.3
    KRR(CM) 10.2±0.3
    DTNN 8.8±3.5
    ANI-1 2.86±0.25
    HANet 9.50±0.71

    The architecture of HANet contains 2 layers of Haar convolution with 8 and 2 filters and then 2 fully connected layers with 400 and 100 neurons. As the graph is not big, we do not use graph pooling or weight sharing. Following [53], we use mean squared error (MSE) plus regularization as the loss function in training and mean absolute error (MAE) as the test metric. We repeat the experiment over 5 splits with the same proportion of training and test data but with different random seeds. We report the average performance and standard deviation of HANet in Table 1 compared against other public results [54] by methods Random Forest (RF) [55], Multitask Networks (Multitask) [56], Kernel Ridge Regression (KRR) [57], Graph Convolutional models (GC) [58], Deep Tensor Neural Network (DTNN) [59], ANI-1 [60], KRR and Multitask with Coulomb Matrix featurization (KRR(CM)/Multitask(CM)) [54]. It shows that HANet ranks third in the list with average test MAE 9.50 and average relative MAE 4.31×10-6, and thus offers a good approximator for QM7 regression.

    TABLE II: Test accuracy comparison on citation networks
    Method Citeseer Cora Pubmed
    MLP 55.1 46.5 71.4
    ManiReg 60.1 59.5 70.7
    SemiEmb 59.6 59.0 71.1
    LP 45.3 68.0 63.0
    DeepWalk 43.2 67.2 65.3
    ICA 69.1 75.1 73.9
    Planetoid 64.7 75.7 77.2
    ChebNet 69.8 81.2 74.4
    GCN 70.3 81.5 79.0
    HANet 70.1 81.9 79.3
    (a) Haar basis matrix Φ (b) CPU time of FFTs (c) CPU time of generating basis
    Fig. 3: (a) Haar basis Φ for Cora with a chain of 11 levels (by METIS). Each column is a vector of the Haar basis. The 98.84% entries of the matrix are zeros. (b) Comparison of time for FHTs and Direct Matrix Product for the Haar basis for graphs with nodes 5,000. (c) Comparison of time for generating the orthonormal bases for Haar and graph Laplacian on graphs with nodes 2,000.
    TABLE III: Sparsity of Haar basis and time of FHTs for citation networks in CPU
    Dataset Haar basis size Sparsity Basis Generate Time (s) Adjoint FHT Time (s) Forward FHT Time (s)
    Citeseer 3327 99.58% 1.93509 0.05276 0.05450
    Cora 2708 98.84% 0.86429 0.06908 0.05515
    Pubmed 19717 99.84% 62.67185 1.08775 1.55694

    VI-C Citation Networks for node classification.

    We test the model (16) on citation networks Citeseer, Cora and Pubmed [61], following the experimental setup of [62, 18]. The Citeseer, Cora and Pubmed are 6, 7 and 3 classification problems with nodes 3327, 2708 and 19717, edges 4732, 5429 and 44338, features 3703, 1433 and 500, and label rates 0.036, 0.052 and 0.003 respectively.

    In Table II, we compare the performance of the model (16) of HANet with methods Multilayer Perceptron (MLP), Manifold Regularization (ManiReg) [36], Semi-supervised Embedding (SemiEmb) [63], Traditional Label Propagation (LP) [64], DeepWalk [65], Link-based Classification (ICA) [66], Planetoid [62], ChebNet [17] and GCN [18]. We repeat the experiment 10 times with different random seeds and report the average test accuracy of HANet. HANet has the top test accuracies on Cora and Pubmed and ranks second on Citeseer. Figure 4 shows the mean and standard deviation of validation accuracies and the validation loss up to epoch 200 of HANet and GCN. HANet achieves slightly higher max accuracy as well as smaller standard deviation, and the loss also converges faster than GCN.

    Fig. 4: Main figure: Mean and standard deviation of validation accuracies of HANet and GCN on Cora with epoch 200. Figure in lower right corner: Validation loss function of HANet and GCN.

    Haar basis and FHTs. In Figure 3a, we show the matrix of the Haar basis vectors for Cora, which has sparsity (i.e. the proportion of zero entries) 98.84%. The associated chain 𝒢100 has 2708, 928, 352, 172, 83, 41, 20, 10, 5, 2, 1 nodes from level 10 to 0. Figure 3b shows the comparison of time for FHTs with direct matrix product. It illustrates that FHTs have nearly linear computational cost while the cost of matrix product grows at 𝒪(N3) for a graph of size N. Figure 3c shows the comparison of time for generating the Haar basis and the basis for graph Laplacian. The Haar basis needs much less time than that for graph Laplacian. Table III gives the sparsity and the CPU time for generating Haar basis and FHTs on three datasets. All sparsity values for three datasets are very high (around 99%), and the computational cost of FHTs is proportional to N.

    VII Conclusion

    We introduce Haar convolution for GNNs, which has a fast implementation in view of the sparsity of the Haar basis matrix. This reduces the computational cost of graph convolution to linear complexity. Haar basis gives a sparse representation of graph data, meanwhile, the geometric property of the data is preserved. Haar convolution can act as an effective alternative of the convolution by graph Laplacian for GNNs, as illustrated by extensive experimental study on benchmarks. The models for HANet can possibly be improved and generalized. For example, training of HANet can be accelerated using neighbour sampling [28], importance sampling [20], or variance reduction [19]. Application of Haar convolution in other architectures is also expected.

    References

    • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
    • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
    • [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
    • [4] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
    • [5] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
    • [6] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
    • [7] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
    • [8] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
    • [9] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International Conference on Machine Learning, 2016, pp. 2014–2023.
    • [10] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
    • [11] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
    • [12] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
    • [13] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” arXiv preprint arXiv:1812.04202, 2018.
    • [14] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review of relational machine learning for knowledge graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, 2015.
    • [15] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.
    • [16] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” arXiv preprint arXiv:1709.05584, 2017.
    • [17] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016, pp. 3844–3852.
    • [18] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
    • [19] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” in ICML, 2018, pp. 941–949.
    • [20] J. Chen, T. Ma, and C. Xiao, “FastGCN: fast learning with graph convolutional networks via importance sampling,” arXiv:1801.10247, 2018.
    • [21] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” in International Conference on Learning Representations, 2019.
    • [22] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
    • [23] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li, “Attention-based graph neural network for semi-supervised learning,” arXiv preprint arXiv:1803.03735, 2018.
    • [24] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
    • [25] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst, “Accelerated filtering on graphs using lanczos method,” arXiv preprint arXiv:1509.04537, 2015.
    • [26] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel, “Lanczosnet: Multi-scale deep graph convolutional networks,” in International Conference on Learning Representations, 2019.
    • [27] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” arXiv preprint arXiv:1902.07153, 2019.
    • [28] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017, pp. 1024–1034.
    • [29] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 1993–2001.
    • [30] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model CNNs,” in CVPR, 2017, pp. 5425–5434.
    • [31] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning.   JMLR. org, 2017, pp. 1263–1272.
    • [32] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
    • [33] Z. Ma, M. Li, and Y. G. Wang, “Pan: Path integral based convolution for deep graph neural networks,” in Workshop on Learning and Reasoning with Graph-Structured Representation. The 36th International Conference on Machine Learning, 2019.
    • [34] A. Haar, “Zur theorie der orthogonalen funktionensysteme,” Mathematische Annalen, vol. 69, no. 3, pp. 331–371, 1910.
    • [35] I. Daubechies, Ten lectures on wavelets.   SIAM, 1992, vol. 61.
    • [36] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, no. Nov, pp. 2399–2434, 2006.
    • [37] C. Chui, F. Filbir, and H. Mhaskar, “Representation of functions on big data: graphs and trees,” Applied and Computational Harmonic Analysis, vol. 38, no. 3, pp. 489 – 509, 2015.
    • [38] G. H. Golub and C. F. Van Loan, Matrix computations.   JHU press, 2012, vol. 3.
    • [39] P. Indyk and M. Kapralov, “Sample-optimal Fourier sampling in any constant dimension,” in 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.   IEEE, 2014, pp. 514–523.
    • [40] P. Indyk, M. Kapralov, and E. Price, “(nearly) sample-optimal sparse fourier transform,” in Proceedings of the Twenty-fifth Annual ACM-SIAM Symposium on Discrete Algorithms.   Society for Industrial and Applied Mathematics, 2014, pp. 480–499.
    • [41] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Simple and practical algorithm for sparse fourier transform,” in Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms.   Society for Industrial and Applied Mathematics, 2012, pp. 1183–1194.
    • [42] F. R. Chung and F. C. Graham, Spectral graph theory.   American Mathematical Society, 1997, no. 92.
    • [43] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
    • [44] Y. G. Wang and X. Zhuang, “Tight framelets and fast framelet filter bank transforms on manifolds,” Applied and Computational Harmonic Analysis, 2018.
    • [45] M. Gavish, B. Nadler, and R. R. Coifman, “Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning,” in ICML, 2010, pp. 367–374.
    • [46] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
    • [47] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
    • [48] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Nearly optimal sparse fourier transform,” in 44th Annual ACM Symposium on Theory of Computing, STOC’12, 2012, pp. 563–577.
    • [49] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” arXiv:1801.07606, 2018.
    • [50] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
    • [51] L. C. Blum and J.-L. Reymond, “970 million druglike small molecules for virtual screening in the chemical universe database GDB-13,” Journal of the American Chemical Society, vol. 131, p. 8732, 2009.
    • [52] M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Physical Review Letters, vol. 108, p. 058301, 2012.
    • [53] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” arXiv preprint arXiv:1704.01212, 2017.
    • [54] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “MoleculeNet: a benchmark for molecular machine learning,” Chemical Science, vol. 9, no. 2, pp. 513–530, 2018.
    • [55] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
    • [56] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
    • [57] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
    • [58] H. Altae-Tran, B. Ramsundar, A. S. Pappu, and V. Pande, “Low data drug discovery with one-shot learning,” ACS Central Science, vol. 3, no. 4, pp. 283–293, 2017.
    • [59] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature Communications, vol. 8, p. 13890, 2017.
    • [60] J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost,” Chemical Science, vol. 8, no. 4, pp. 3192–3203, 2017.
    • [61] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, p. 93, 2008.
    • [62] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings,” in ICML, 2016, pp. 40–48.
    • [63] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semi-supervised embedding,” in Neural Networks: Tricks of the Trade.   Springer, 2012, pp. 639–655.
    • [64] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, 2003, pp. 912–919.
    • [65] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in ACM SIGKDD.   ACM, 2014, pp. 701–710.
    • [66] Q. Lu and L. Getoor, “Link-based classification,” in ICML, 2003, pp. 496–503.