Graph Neural Networks for Small Graph and Giant Network Representation Learning: An Overview

  • 2019-08-01 02:35:12
  • Jiawei Zhang
  • 30

Abstract

Graph neural networks denote a group of neural network models introduced forthe representation learning tasks on graph data specifically. Graph neuralnetworks have been demonstrated to be effective for capturing network structureinformation, and the learned representations can achieve the state-of-the-artperformance on node and graph classification tasks. Besides the differentapplication scenarios, the architectures of graph neural network models alsodepend on the studied graph types a lot. Graph data studied in research can begenerally categorized into two main types, i.e., small graphs vs. giantnetworks, which differ from each other a lot in the size, instance number andlabel annotation. Several different types of graph neural network models havebeen introduced for learning the representations from such different types ofgraphs already. In this paper, for these two different types of graph data, wewill introduce the graph neural networks introduced in recent years. To be morespecific, the graph neural networks introduced in this paper include IsoNN,SDBN, LF&ER, GCN, GAT, DifNN, GNL, GraphSage and seGEN. Among these graphneural network models, IsoNN, SDBN and LF&ER are initially proposed for smallgraphs and the remaining ones are initially proposed for giant networksinstead. The readers are also suggested to refer to these papers for detailedinformation when reading this tutorial paper.

 

Quick Read (beta)

Graph Neural Networks for Small Graph and Giant Network Representation Learning: An Overview

\nameJiawei Zhang \email[email protected]
\addrFounder and Director
Information Fusion and Mining Laboratory
(First Version: July 2019; Revision: July 2019.)
Abstract

Graph neural networks denote a group of neural network models introduced for the representation learning tasks on graph data specifically. Graph neural networks have been demonstrated to be effective for capturing network structure information, and the learned representations can achieve the state-of-the-art performance on node and graph classification tasks. Besides the different application scenarios, the architectures of graph neural network models also depend on the studied graph types a lot. Graph data studied in research can be generally categorized into two main types, i.e., small graphs vs. giant networks, which differ from each other a lot in the size, instance number and label annotation. Several different types of graph neural network models have been introduced for learning the representations from such different types of graphs already. In this paper, for these two different types of graph data, we will introduce the graph neural networks introduced in recent years. To be more specific, the graph neural networks introduced in this paper include IsoNN [4], SDBN [7], LF&ER [6], GCN [3], GAT [5], DifNN [8], GNL [1], GraphSage [2] and seGEN [9]. Among these graph neural network models, IsoNN, SDBN and LF&ER are initially proposed for small graphs and the remaining ones are initially proposed for giant networks instead. The readers are also suggested to refer to these papers for detailed information when reading this tutorial paper.

\usetikzlibrary

arrows \usetikzlibrarygraphs

Graph Neural Networks for Small Graph and Giant Network Representation Learning: An Overview Jiawei Zhang [email protected]
Founder and Director
Information Fusion and Mining Laboratory
(First Version: July 2019; Revision: July 2019.)

Keywords: Graph Neural Network; Representation Learning; Graph Mining; Deep Learning

Contents

1 Introduction

In the era of big data, graph provides a generalized representation of many different types of inter-connected data collected from various disciplines. Besides the unique attributes possessed by individual nodes, the extensive connections among the nodes can convey very complex yet important information. Graph data are very difficult to deal with because of their various shapes (e.g., small brain graphs vs. giant online social networks), complex structures (containing various kinds of nodes and extensive connections) and diverse attributes (attached to the nodes and links). Great challenges exist in handling the graph data with traditional machine learning algorithms directly, which usually take feature vectors as the input. Viewed in such a perspective, learning the feature vector representations of graph data will be an important problem.

The graph data studied in research can be generally categorized into two main types, i.e., small graphs vs. giant networks, which differ from each other a lot in the size, instance number and label annotation.

  • The small graphs we study are generally of a much smaller size, but with a large number of instances, and each graph instance is annotated with certain labels. The representative examples include the human brain graph, molecular graph, and real-estate community graph, whose nodes (usually only in hundreds) represent the brain regions, atoms and POIs, respectively.

  • On the contrary, giant networks in research usually involve a large number of nodes/links, but with only one single network instance, and individual nodes are labeled instead of the network. Examples of giant networks include social network (e.g., Facebook), eCommerce network (e.g., Amazon) and bibliographic network (e.g., DBLP), which all contain millions even billions of nodes.

Due to these property distinctions, the representation learning algorithms proposed for small graphs and giant networks are very different. To solve the small graph oriented problems, the existing graph neural networks focus on learning a representation of the whole graph (not the individual nodes) based on the graph structure and attribute information. Several different graph neural network models have been introduced already, including IsoNN (Isomorphic Neural Network) [4], SDBN (Structural Deep Brain Network) [7] and LF&ER (Deep Autoencoder based Latent Feature Extraction) [6]. These models are proposed for different small graph oriented application scenarios, covering both brain graphs and community POI graphs, which can also be applied to other application settings with minor extensions.

Meanwhile, for the giant network studies, in recent years, many research works propose to apply graph neural networks to learn their low-dimensional feature representations, where each node is represented as a feature vector. With these learned node representations, the graph neural network model can directly infer the potential labels of the nodes/links. To achieve such objectives, several different type of graph neural network models have been introduced, including GCN (Graph Convolutional Network) [3], GAT (Graph Attention Network) [5], DifNN (Deep Diffusive Neural Network) [8], GNL (Graph Neural Lasso), GraphSage (Graph Sample and Aggregate) [2] and seGEN (Sample and Ensemble Genetic Evolutionary Network) [9].

In this paper, we will introduce the aforementioned graph neural networks proposed for small graphs and giant networks, respectively. This tutorial paper will be updated accordingly as we observe the latest developments on this topic.

2 Graph Neural Networks for Small Graphs

In this section, we will introduce the graph neural networks proposed for the representation learning tasks on small graphs. Formally, we can represent the set of small graphs to be studied in this section as 𝒢=(G1,𝐲1),(G2,𝐲2),,(Gn,𝐲n), where Gi=(𝒱i,i) denotes a small graph instance and 𝐲idy denotes its label vector. Given a graph Gi𝒢, we can denote its network size as the number of involved nodes, i.e., |𝒱i|. Normally, the small graphs to be studied in set 𝒢 are of the same size. Meanwhile, depending on the application scenarios, the objective labels of the graph instances can be binary-class/multi-class vectors. The small graph oriented graph neural networks aim at learning a mapping, i.e., f:𝒢dh, to project the graph instances to their feature vector representations, which will be further utilized to infer their corresponding labels. Specifically, the graph neural network models to be introduced in this section include IsoNN [4], SDBN [7] and LF&ER [6]. The readers are also suggested to refer to these papers for detailed information when reading this tutorial paper.

2.1 IsoNN: Isomorphic Neural Network

Figure 1: IsoNN Model Architecture [4]. (The left plot provides the outline of IsoNN, including the graph isomorphic feature extraction and classification components. The right plot illustrates the detailed architecture of the proposed model, where the graph isomorphic features are extracted with the graph isomorphic layer and two min-pooling layers, and the graphs are further classified with three fully-connected layers.)

Graph isomorphic neural network (IsoNN) proposed in [4] recently aims at extracting meaningful sub-graph patterns from the input graph for representation learning. Sub-graph mining techniques have been demonstrated to be effective for feature extraction in the existing works. Instead of designing the sub-graph templates manually, IsoNN proposes to integrate the sub-graph based feature extraction approaches into the neural network framework for automatic feature representation learning. As illustrated in Figure 1, IsoNN includes two main components: graph Isomorphic feature extraction component and classification component. IsoNN can be in a deeper architecture by involving multiple graph isomorphic feature extraction components so that the model will learn more complex sub-graph patterns.

The graph isomorphic feature extraction component in IsoNN targets at the automatic sub-graph pattern learning and brain graph feature extraction with the following three layers:graph isomorphic layer, min-pooling layer 1 and min-pooling layer 2, which will be introduced as follows, respectively. Meanwhile, the classification component used in IsoNN involves several fully connected layers, which project the learned isomorphic features to the corresponding graph labels.

2.1.1 Graph Isomorphic Layer

In IsoNN, the sub-graph based feature extraction process is achieved by a novel graph isomorphic layer. Formally, given a brain graph G=(𝒱,), its adjacency matrix can be represented as 𝐀|𝒱|×|𝒱|. In order to find the existence of specific sub-graph patterns in the input graph, IsoNN matches the input graph with a set of sub-graph templates. Instead of defining these sub-graph templates manually as the existing works, each template is denoted as a kernel variable 𝐊ik×k,i{1,2,,c} and IsoNN will learn these kernel variables automatically. Here, k denotes the node number in the templates and c is the channel number. Meanwhile, to match one template (i.e., the kernel variable matrix 𝐊i) with regions in the input graph (i.e., sub-matrices in 𝐀), IsoNN uses a set of permutation matrices, which map both rows and columns of the kernel matrix to the sub-matrices in 𝐀 effectively. The permutation matrix can be represented as 𝐏{0,1}k×k that shares the same dimensions with the kernel matrix. Given a kernel variable matrix 𝐊i and a regional sub-matrix 𝐌(s,t)k×k in 𝐀 (where 𝐌(s,t)(1:k,1:k)=𝐀(s:s+k-1,t:t+k-1) and index pair s,t{1,2,,(|𝒱|-k+1)}), there may exist k! different such permutation matrices and the optimal one can be denoted as 𝐏*:

𝐏*=argmin𝐏𝒫𝐏𝐊i𝐏-𝐌(s,t)F2, (1)

where 𝒫={𝐏1,𝐏2,,𝐏k!} covers all the potential permutation matrices. The F-norm term measures the mapping loss, which is also used as the graph isomorphic feature in IsoNN. Formally, the isomorphic feature extracted based on the kernel 𝐊i for the regional sub-matrix 𝐌(s,t) in 𝐀 can be represented as

zi,(s,t) =𝐏*𝐊i(𝐏*)-𝐌(s,t)F2 (2)
=min{𝐏𝐊i𝐏-𝐌(s,t)F2}𝐏𝒫
=min(𝐳¯i,(s,t)(1:k!)),

where vector 𝐳¯i,(s,t)k! with 𝐳¯i,(s,t)(j)=𝐏j𝐊i𝐏j-𝐌(s,t)F2,j{1,2,,k!} denoting the features computed on the permutation matrix 𝐏j𝒫. Furthermore, by shifting the kernel matrix 𝐊i on regional sub-matrices in 𝐀, the isomorphic features extracted by IsoNN from the input graph can be denoted as a 3-way tensor 𝒵ik!×(|𝒱|-k+1)×(|𝒱|-k+1), where 𝒵i(1:k!,s,t)=𝐳¯i,(s,t)(1:k!).

2.1.2 Min-pooling Layers

Min-pooling Layer 1: As indicated by the Figure 1, IsoNN computes the final isomorphic features with the optimal permutation matrix for the kernel 𝐊i via two steps: (1) computing all the potential isomorphic features via different permutation matrices with the graph isomorphic layer, and (2) identifying the optimal features with the min-pooling layer 1 and layer 2. Formally, given the tensor 𝒵i computed by 𝐊i in the graph isomorphic layer, IsoNN will identify the optimal permutation matrices via the min-pooling layer 1. From tensor 𝒵i, the features computed with the optimal permutation matrices can be denoted as 𝐙i, where

𝐙i(s,t)=min{𝒵i(1:k!,s,t)},s,t{1,2,,(|𝒱|-k+1)}. (3)

The min-pooling layer 1 learns the optimal feature matrix 𝐙i for kernel 𝐊i along the first dimension of tensor 𝒵i, which are computed by the optimal permutation matrices. In a similar way, for the remaining kernels, their optimal graph isomorphic features can be obtained and denoted as matrices 𝐙1, 𝐙2, , 𝐙c, respectively.

Min-pooling Layer 2: For the same region in the input graph, different kernels can be applied to match the regional sub-matrix. Inspired by this, IsoNN incorporates the min-pooling layer 2, so that the model can find the best kernels that match the regions in 𝐀. With inputs 𝐙1,𝐙2,,𝐙c, the min-pooling layer 2 in IsoNN can identify the optimal features across all the kernels, which can be denoted as matrix 𝐐 with

𝐐(s,t)=min{𝐙1(s,t),𝐙2(s,t),,𝐙c(s,t)},s,t{1,2,,(|𝒱|-k+1)}. (4)

Entry 𝐐(s,t) denotes the graph isomorphic feature computed by the best sub-graph kernel on the regional matrix 𝐌(s,t) in 𝐀. Thus, via min-pooling layer 2, let 𝐐 be the final isomorphic feature matrix, which preserves the best sub-graph patterns contributing to the classification result. In addition, min-pooling layer 2 also effectively shrinks the feature length and greatly reduces the number of variables to be learned in the following classification component.

2.1.3 Classification Component

Given a brain graph instance Gg (𝒯 denotes the training batch), its extracted isomorphic feature matrix can be denoted as 𝐐g. By feeding its flat vectorized representation vector 𝐪g=vec(𝐐g) as the input into the classification component (with three fully-connected layers), the predicted label vector by IsoNN on the instance can be represented as 𝐲^g. Several frequently used loss functions, e.g., cross-entropy, can be used to measure the introduced loss between 𝐲^g and the ground-truth label vector 𝐲g. Formally, the fully-connected (FC) layers and the loss function used in IsoNN can be represented as follows:

FC Layers: {𝐝1=σ(𝐖1𝐪g+𝐛1),𝐝2=σ(𝐖2𝐝1+𝐛2),𝐲^g=softmax(𝐖3𝐝2+𝐛3); (5)

and

 Loss Function: (𝚯)=-gj𝐲g(j)log𝐲^g(j), (6)

where 𝐖i and 𝐛i are the weight and biase in ith layer, σ() denotes the sigmoid activation function and softmax() is the softmax function for output normalization. Variables 𝚯=({𝐊}i=1k,{𝐖i,𝐛i}i=13) (including the kernel matrices and weight/bias terms) involved in the model can be effectively learned with the error back propagation algorithm by minimizing the above loss function. For more information about IsoNN, the readers are suggested to refer to [4] for detailed descriptions.

2.2 SDBN: Structural Deep Brain Network

Structural Deep Brain Network (SDBN) initially proposed [7] applies the deep convolutional neural network to the brain graph mining problem, which can be viewed as an integration of convolutional neural network and autoencoder. Similar to IsoNN, SDBN also observes the order-less property with the brain graph data, and introduce a graph reordering approach to resolve the problem.

Figure 2: SDBN Model Architecture [7].

As illustrated in Figure 2, besides the necessary graph data processing and representation, SDBN involves three main steps to learn the final representations and labels fo the graph instances, i.e., graph reordering, structural augmentation and convolutional feature learning, which will be introduced as follows, respectively.

2.2.1 Graph Reordering

Given the graph set 𝒢={G1,G2,,Gn}, the goal of graph reordering is to find a node labeling n such that for any two graphs Gi,Gj𝒢 randomly draw from 𝒢, the expected differences between the distance of the graph connectivity adjacency matrices based on n and the distance of the graphs in the graph space is minimized. Formally, for each graph instance Gi𝒢, its connectivity adjacency matrix can be denoted as 𝐀i. Let d𝒜 and d𝒢 denote the distance metrics on the adjacency matrix domain 𝒜 and graph domain 𝒢 respectively, the graph reordering problem can be formulated as the following optimization problem:

argminn𝔼Gi,Gj𝒢(d𝒜(𝐀i,𝐀j)-d𝒢(Gi,Gj)). (7)

Graph reordering is a combinatorial optimization problem, which has also be demonstrated to be NP-hard and is computationally infeasible to address in polynomial time. SDBN proposes to apply the spectral clustering to help reorder the nodes and brain graph connectivity instead. Formally, based on the brain graph adjacency matrix 𝐀i of Gi, its corresponding Laplacian matrix can be represented as 𝐋i. The spectral clustering algorithm aims at partitioning the brain graph Gi into K modules, where the node-module belonging relationships are denoted by matrix 𝐅|𝒱|×K. The optimal 𝐅 can be effectively learned with the following objective function:

min𝐅tr(𝐅𝐋i𝐅) s.t.𝐅𝐅=𝐈, (8)

where 𝐈{0,1}K×K denotes an identity matrix and the constraint is added to ensure one node is assigned to one module only. From the learned optimal 𝐅, SDBN can assign the nodes in graph Gi to their modules ={M1,M2,,MK}, where 𝒱=i=1KMi and MiMj=,ij and i,j{1,2,,K}. Such learned modules can help reorder the nodes in the graph into relatively compact regions, and the graph connectivity adjacency matrix 𝐀i after reordering can be denoted as 𝐀~i. Similar operations will be performed on the other graph instances in the set 𝒢.

2.2.2 Structural Augmentation

Prior to feeding the reordered graph adjacency matrix to the deep convolutional neural network for representation learning, SDBN proposes to augment the network structure by both refining the connectivity adjacency matrix and creating an additional module identification channel.

  • Reordered Adjacency Matrix Refinement: Formally, for graph Gi𝒢, based on its reordered adjacency matrix 𝐀~i obtained from the previous step, SDBN proposes to refine its entry values with the following equation:

    𝐀~i(p,q)={1 for vp,vqMk,Mk;ϵ otherwise. (9)

    In the equation, term ϵ denotes a small constant.

  • Module Identification Channel Creation: From the reordered adjacency matrix 𝐀~i for graph Gi𝒢, the learned module identity information is actually not preserved. To effectively incorporate such information in the model, SDBN proposes to create one more channel 𝐌i|𝒱|×|𝒱| for graph Gi, whose entry values can be denoted as follows:

    𝐌i(p,q)={k for vp,vqMk,Mk;0 otherwise. (10)

Formally, based on the above operations, the inputs for the representation learning component on graph Gi will be 𝐗i=[𝐀~i;𝐌i], which encodes much more information and can help learn more useful representations.

2.2.3 Learning of the SDBN Model

Figure 3: SDBN architecture with three types of unsupervised learning augmentation [7].

As illustrated in Figure 3, based on the input matrix 𝐗 for the graphs in 𝒢 (here, the subscript of 𝐗 is not indicated and it can represent any graphs in 𝒢), SDBN proposes to apply the convolutional neural network for the graph representation learning. To be specific, the convolutional neural network used in SDBN involves two operators: conv and pool, which can be stacked together multiple times to form a deep architecture.

Formally, according to Figure 3, the intermediate representations of the input graphs as well as the corresponding labels in the SDBN can be represented with the following equations:

{𝐗(1)=pool(conv(𝐗;𝚯));𝐗(2)=pool(conv(𝐗(1);𝚯));𝐱=reshape(𝐗(2));𝐲^=FC(𝐱;𝚯), (11)

where reshape() flattens the matrix to a matrix and FC() denotes the fully-connected layers in the model. In the above equations, 𝚯 denotes the involved variables in the model, which will be optimized.

Based on the above model, for all the graph instances Gi𝒢, we can represent the introduced loss terms by the model as

(𝚯)=Gi𝒢(Gi;𝚯)=Gi𝒢(𝐲i,𝐲^i), (12)

where 𝐲i and 𝐲^i represent the ground-truth label vector and the inferred label vector of graph Gi, respectively.

Meanwhile, in addition to the above loss term, SDBN also incorporates the autoencoder into the model learning process via the depool and deconv operations. The conv and pool operators mentioned above compose the encoder part, whereas the deconv and depool operators will form the decoder part. Formally, based on the learned intermediate representation 𝐗(2) of the input graph matrix 𝐗, SDBN computes the recovered representations as follows:

𝐗^(1) =deconv(depool(𝐗(2));𝚯); (13)
𝐗^ =deconv(depool(𝐗^(1));𝚯).

By minimizing the difference between 𝐗^(1) and 𝐗(1), as well as the difference between 𝐗^ and 𝐗, i.e.,

(𝐗^(1),𝐗(1))=𝐗^(1)-𝐗(1)22; and (𝐗^,𝐗)=𝐗^-𝐗22 (14)

SDBN can effectively learn the involved variables in the model. As illustrated in Figure 3, the decoder step can work in different manner, which will lead to different regularization terms on the intermediate representations. The performance comparison between IsoNN and SDBN is also reported in [4], and the readers may refer to [4, 7] for more detailed information of the models and the experimental evaluation results.

2.3 LF&ER: Deep Autoencoder based Latent Feature Extraction

Deep Autoencoder based Latent Feature Extraction (LF&ER) initially proposed in [6] serves an a latent feature extraction component in the final model introduced in that paper. Based on the input community real-estate POI graphs, LF&ER aims to learn the latent representations of the POI graphs, which will be used to infer the community vibrancy scores.

Figure 4: The LF&ER Framework for Latent Feature Extraction [6].

2.3.1 Deep Autoencoder Model

The LF&ER model works based on the deep autoencoder actually. Autoencoder is an unsupervised neural network model, which projects the instances in original feature representations into a lower-dimensional feature space via a series of non-linear mappings. Figure 4 shows that autoencoder model involves two steps: encode and decode. The encode part projects the original feature vector to the objective feature space, while the decode step recovers the latent feature representation to a reconstruction space. In autoencoder model, we generally need to ensure that the original feature representation of instances should be as similar to the reconstructed feature representation as possible.

Formally, let 𝐱 represent the original feature representation of instance i, and 𝐲(1),𝐲(2),,𝐲(o) be the latent feature representations of the instance at hidden layers 1,2,,o in the encode step respectively, the encoding result in the objective lower-dimension feature space can be represented as 𝐳dz with dimension dz. Formally, the relationship between these vector variables can be represented with the following equations:

{𝐲(1)=σ(𝐖(1)𝐱+𝐛(1)),𝐲(k)=σ(𝐖(k)𝐲(k-1)+𝐛(k)),k{2,3,,o},𝐳=σ(𝐖(o+1)𝐲(o)+𝐛(o+1)). (15)

Meanwhile, in the decode step, the input will be the latent feature vector 𝐳 (i.e., the output of the encode step), and the final output will be the reconstructed vector 𝐱^. The latent feature vectors at each hidden layers can be represented as 𝐲^(o),𝐲^(o-1),,𝐲^(1). The relationship between these vector variables can be denoted as

{𝐲^(o)=σ(𝐖^(o+1)𝐳+𝐛^(o+1)),𝐲^(k-1)=σ(𝐖^(k)𝐲^(k)+𝐛^(k)),k{2,3,,o},𝐱^=σ(𝐖^(1)𝐲^(1)+𝐛^(1)). (16)

In the above equations, 𝐖 and 𝐛 with different subscripts denote the weight matrices and bias terms to be learned in the model. The objective of the autoencoder model is to minimize the loss between the original feature vector 𝐱 and the reconstructed feature vector 𝐱^. Formally, the loss term can be represented as

(𝚯)=(𝐱,𝐱^;𝚯)=𝐱-𝐱^22, (17)

where 𝚯 denotes the variables involved in the autoencoder model.

2.3.2 Latent Representation Learning

LF&ER proposes to learn the community allocation information for the vibrancy inference and ranking. Formally, spatial structure denotes the distribution of POIs inside the community, e.g., a grocery store lies between two residential buildings; a school is next to the police office. The Spatial structure can hardly be represented with explicit features extracted before, and LF&ER proposes to represent them with a set of latent feature vectors extracted from the geographic distance graph and the mobility connectivity graph defined in the previous subsection. The autoencoder model is applied here for the latent feature extraction.

Autoencoder model has been applied to embed the graph data into lower-dimensional spaces in many of the research works, which will obtain a latent feature representation for the nodes inside the graph. Different from these works, instead of calculating the latent feature for the POI categories inside the communities, LF&ER aims at obtaining the latent feature vector for the whole community, i.e., embedding the graph as one latent feature vector.

As shown in Figure 4, LF&ER transforms the matrix of the graphical distance graph (involving the POI categories) 𝐃 into a vector, which can be denoted as

𝐝=reshape(𝐃)|𝒱|2×1. (18)

Vector 𝐝 will be used as the input feeding into the autoencoder model. The latent embedding feature vector of 𝐝 can be represented as 𝐳D (i.e., the vector 𝐳 as introduced in the autoencoder model in the previous section), which depicts the layout information of POI categories in the community in terms of the geographical distance. Besides the static layout based on geographic distance graph, the spatial structure of the POIs in the communities can also be revealed indirectly through the human mobility. For a pair of POI categories which are far away geographically, if people like to go between them frequently, it can display another type of structure of the POIs in terms of their functional correlations. Via the multiple fully connected layers, LF&ER will project such learned features to the objective vibrancy scores of the community. We will not introduce the model learning part here, since it also involves the ranking models and explicit feature engineering works, which is not closely related to the topic of this paper. The readers may refer to [6] for detailed description about the model and its learning process. In addition, autoencoder (i.e., the base model of LF&ER) is also compared against IsoNN, whose results are reported in [4].

3 Graph Neural Networks for Giant Networks

In this section, we will introduce the graph neural networks proposed for the representation learning tasks on giant networks instead. Formally, we can represent the giant network instance to be studied in this section as G=(𝒱,), where 𝒱 and denote the sets of nodes and links in the network, respectively. Different from the small graph data studied in Section 2, the nodes in the giant network G are partially annotated with labels instead. Formally, we can represent the set of labeled nodes as 𝒱l={(vl,1,𝐲l,1),(vl,2,𝐲l,2),,(vl,m,𝐲l,m)}, where vl,i𝒱,i{1,2,,m} and 𝐲l,i denotes its label vector; whereas the remaining unlabeled nodes can be represented as 𝒱u=𝒱𝒱l. In the case where all the involved network nodes are labeled, we will have 𝒱l=𝒱 and 𝒱u=, which will be a special learning scenario of the general partial-labeled learning setting as studied in this paper. The giant network oriented graph neural networks aim at learning a mapping, i.e., f:𝒱dh, to obtain the feature vector representations of the nodes in the network, which can be utilized to infer their labels. To be more specific, the models to be introduced in this section include GCN [3], GAT [5], DifNN [8], GNL [1], GraphSage [2] and seGEN [9].

3.1 GCN: Graph Convolutional Network

Graph convolutional network (GCN) initially proposed in [3] introduces a spectral graph convolution operator for the graph data representation learning, which also provides several different approximations of the operator to encode both the graph structure and features of the nodes. GCN works well for the partially labeled giant networks, and the learned node representations can be effectively applied for the node classification task.

3.1.1 Spectral Graph Convolution

Formally, given an input network G=(𝒱,), its network structure information can be denoted as an adjacency matrix 𝐀={0,1}|𝒱|×|𝒱|. The corresponding normalized graph Laplacian matrix can be denoted as 𝐋=𝐈-𝐃-12𝐀𝐃-12=𝐔𝚲𝐔, where 𝐃 is a diagonal matrix with entries 𝐃(i,i)=j𝐀(i,j) on its diagonal and 𝐈=diag(𝟏){0,1}|𝒱|×|𝒱| is an identity matrix with ones on its diagonal. The eigen-decomposition of matrix 𝐋 can be denoted as 𝐋=𝐔𝚲𝐔, where 𝐔 denotes the eigen-vector matrix and diagonal matrix 𝚲 has eigen-values on its diagonal.

The spectral convolution operator defined on network G in GCN is denoted as a multiplication of an input signal vector 𝐱dx with a filter 𝐠𝜽=diag(𝜽) (parameterized by variable vector 𝜽dθ) in the Fourier domain as follows:

𝐠𝜽*𝐱=𝐔𝐠𝜽𝐔𝐱, (19)

where notation 𝐔𝐱 is defined as the graph Fourier transform of 𝐱 and 𝐠𝜽 can be understood as a function on the eigen-values, i.e., 𝐠𝜽(𝚲).

According to Equ. (19), the computation cost of the term on the right-hand-side will be 𝒪(|𝒱|2). For the giant networks involving millions even billions of nodes, the computation of the graph convolution term will become infeasible, not to mention the eigen-decomposition of the Laplacian matrix 𝐋 defined before. Therefore, to resolve such a problem, [3] introduces an approximation of the filter function 𝐠𝜽(𝚲) by a truncated expansion in terms of the Chebyshev polynomial Tk() up to the Kth order as follows:

𝐠𝜽*𝐱k=0K𝜽(k)Tk(𝐋~)𝐱, (20)

where 𝐋~=2λmax𝐋-𝐈 and λmax is the largest eigen-value in matrix 𝚲. Vector 𝜽k is a vector of Chebyshev coefficients. Noticing that the computational complexity of the term on the right-hand-side is 𝒪(||), i.e., linear in terms of the edge numbers, which will be lower than that of Equ. (19) introduced before.

Figure 5: Schematic depiction of multi-layer GCN for semisupervised learning with C input channels and F feature maps in the output layer [3].

3.1.2 Graph Convolution Approximation

As proposed in [3], Equ. (20) can be further simplified by setting K=1, λmax=2, 𝜽(0)=-𝜽(1)=θ, which will reduce the right hand-term of Equ. (20) approximately as follows:

𝐠𝜽*𝐱𝜽(0)𝐱+𝜽(1)𝐋~𝐱 =θ(𝐈+𝐃-12𝐀𝐃-12)𝐱 (21)
=θ(𝐃~-12𝐀~𝐃~-12)𝐱,

where 𝐀~=𝐀+𝐈 and 𝐃~ is the diagonal matrix defined on 𝐀~ instead.

As illustrated in Figure 5, in the case when there exist C input channels, i.e., the input will be a matrix 𝐗|𝒱|×C, and F different filters defined above, the learned graph convolution feature representations will be

𝐙 =(𝐃~-12𝐀~𝐃~-12)𝐗𝐖, (22)
=𝐀^𝐗𝐖.

where matrix 𝐀^=𝐃~-12𝐀~𝐃~-12 can be pre-computed in advance. Matrix 𝐖C×F is the filter parameter matrix and 𝐙|𝒱|×F will be the learned convolved representations of all the nodes. The computational time complexity of the operation will be 𝒪(||FC).

3.1.3 Deep Graph Convolutional Network Learning

The GCN model can have a deeper architecture by involving multiple graph convolution operators defined in the previous sections. For instance, the GCN model with two layers can be represented with the following equations:

{Layer 1: 𝐙=ReLU(𝐀^𝐗𝐖1);Output Layer: 𝐘^=softmax(𝐀^𝐙𝐖2).𝐘^=softmax(𝐀^ReLU(𝐀^𝐗𝐖1)𝐖2). (23)

In the above equation, matrices 𝐖1 and 𝐖2 are the involved variables in the model. ReLU is used as the activation function for the hidden layer 1, and softmax function is used for the output result normalization. By comparing the inferred labels, i.e., 𝐘^, of the labeled instances against their ground-truth labels, i.e., 𝐘, the model variables can be effectively learned by minimizing the following loss function:

(𝚯)=-vi𝒱Lj=1dy𝐘(i,j)log𝐘^(i,j), (24)

where 𝚯 covers all the variables in the model.

For representation simplicity, node subscript is used as its corresponding index in the label matrix 𝐘. Notation dy denotes the number of labels in the studied problem, and dy=2 for the traditional binary classification tasks. The readers can also refer to [3] for detailed information of the GCN model.

3.2 GAT: Graph Attention Network

Figure 6: Schematic depiction of GAT with multi-head attention for the node representation update [5].

Graph attention network (GAT) initially proposed in [5] can be viewed as an extension of GCN. In updating the nodes’ representations, instead of assigning the neighbors with fixed weights, i.e., values in matrix 𝐀^ in Equ. (22) and Equ. (23), GAT introduces an attention mechanism to compute the weights based on the representations of the target node as well as its neighbors.

3.2.1 Graph Attention Coefficient Computation

Formally, given an input network G=(𝒱,) and the raw features of the nodes, the node features can be denoted as a matrix 𝐗|𝒱|×dx, where dx denotes the dimension of the node feature vectors. Furthermore, for node vi𝒱, its feature vector can also be represented as 𝐱i=𝐗(i,:) for simplicity. Without considerations about the network structures, via a mapping matrix 𝐖dx×dh, the nodes can be projected to their representations in the hidden layer. Meanwhile, to further incorporate the network structure information into the model, based on the network structure, the neighbor set of node vi can be denoted as Γ(vi)={vj|vj𝒱(vi,vj)}{vi}, where {vi} is also added and treated as the self-neighbor. As illustrated in Figure 6, GAT proposes to compute the attention coefficient between nodes vi and vj (if vjΓ(vi)) as follows:

ei,j=LeakyReLU(𝐚(𝐖𝐱i𝐖𝐱j)), (25)

where 𝐚2dh is a variable vector for weighted sum of the entries in vector 𝐖𝐱i𝐖𝐱j and denotes the concatenation operator of two vectors. LeakyReLU function is added here mainly for the model learning considerations.

To further normalize the coefficients among all the neighbors, GAT further adopts the softmax function based on the coefficients defined above. Formally, the final computed weight between nodes vi and vj can be denoted as

αi,j =softmax(ei,j) (26)
=exp(ei,j)vkΓ(vi)exp(ei,k)
=exp(LeakyReLU(𝐚(𝐖𝐱i𝐖𝐱j)))vkΓ(vi)exp(LeakyReLU(𝐚(𝐖𝐱i𝐖𝐱k))).

3.2.2 Representation Update via Neighborhood Aggregation

GAT effectively update the nodes’ representations by aggregating the information from their neighbors (including the self-neighbor). Formally, the learned hidden representation of node vi can be represented as

𝐡i=σ(vjΓ(vi)αi,j𝐖𝐱j). (27)

GAT can be in a deeper architecture by involving multiple attentive node representation updating. In the deep architecture, for the upper layers, the representation vector 𝐡i will be treated as the inputs feature vector instead, and we will not over-elaborate that here.

3.2.3 Multi-Head Attention Aggregation

As introduced in [5], to stabilize the learning process of the model, GAT can be further extended to include the multi-head attention as illustrated in Figure 6. Specifically, let K denote the number of involved attention mechanisms. Based on each attention mechanism, the learned representations of node vi based on the above descriptions (i.e., Equ. (27)) can be denoted as 𝐡i(1), 𝐡i(2), , 𝐡i(K), respectively. By further aggregating such learned representations together, the ultimate learned representation of node vi can be denoted as

𝐡i=Aggregate(𝐡i(1),𝐡i(2),,𝐡i(K)). (28)

Several different aggregation function is tried in [5], including concatenation and average:

  • Concatenation:

    𝐡i=k=1Kσ(vjΓ(vi)αi,j(k)𝐖(k)𝐱j). (29)
  • Average:

    𝐡i=σ(1Kk=1KvjΓ(vi)αi,j(k)𝐖(k)𝐱j). (30)

The learning process of the GAT model is very similar to that of GCN introduced in Section 3.1.3, and we will not introduce that part again here. The readers can also refer to [5] for more detailed descriptions about the model as well as its experimental performance.

3.3 DifNN: Deep Diffusive Neural Network

Deep diffusive network (DifNN) model initially introduced in [8] aims at modeling the diverse connections in heterogeneous information networks, which contains multiple types of nodes and links. DifNN is based on a new type of neuron, namely gated diffusive unit (GDU), which can be extended to incorporate the inputs from various groups of neighbors.

3.3.1 Model Architecture

Figure 7: An example of the news augmented heterogeneous social network
Figure 8: The DifNN model architecture build for the input heterogeneous social network

Given a heterogeneous input network G=(𝒱,), the node set 𝒱 in the network can be divided into multiple subsets depending on their node types. It is similar for the links in set . Here, for the representation simplicity, we will follow the news augmented heterogeneous social network example illustrated in [8] when introducing the model. As illustrated in Figure 8, there exist three different types of nodes (i.e., creator, news article and subject) and two different types of links (i.e., the creator-article link and article-subject link) in the network. Formally, the node set can be categories into three subsets, i.e., 𝒱=𝒰𝒩𝒮, and the link set can be categorized into two subsets, i.e., =u,nn,s.

For each node in the network, e.g., vi𝒱, its extracted raw feature vector can be denoted as 𝐱i. As introduced at the beginning of Section 3, in many cases, the network is partially labeled. Formally, the label vector of node vi is represented as 𝐲i. For each nodes in the input network, DifNN utilizes one GDU (which will be introduced in the following subsection) to model its representations and the connections with other neighboring nodes. For instance, based on the input network in Figure 8, its corresponding DifNN model architecture can be represented in Figure 8. Via the gdu neuron unit, DifNN can effectively project the node inputs to their corresponding labels. The parameters involved in the DifNN model can be effectively trained based on the labeled nodes via the back propagation algorithm. In the following two subsections, we will introduce the detailed information about GDU as well as the DifNN model training.

3.3.2 Gated Diffusive Unit

Figure 9: An illustration of the gated diffusive unit (GDU).

To introduce the GDU neuron, we can take news article nodes as an example here. Formally, among all the inputs of the GDU model, 𝐱i denotes the extracted feature vector for news articles, 𝐳i represents the input from other GDUs corresponding to subjects, and 𝐭i represents the input from other GDUs about creators. Considering that the GDU for each news article may be connected to multiple GDUs of subjects and creators, the mean() of the outputs from the GDUs corresponding to these subjects and creators will be computed as the inputs 𝐳i and 𝐭i instead respectively, which is also indicated by the GDU architecture illustrated in Figure 9. For the inputs from the subjects, GDU has a gate called the “forget gate”, which may update some content of 𝐳i to forget. The forget gate is important, since in the real world, different news articles may focus on different aspects about the subjects and “forgetting” part of the input from the subjects is necessary in modeling. Formally, the “forget gate” together with the updated input can be represented as

𝐳~i=𝐟i𝐳i, where 𝐟i=σ(𝐖f[𝐱i,𝐳i,𝐭i]). (31)

Here, operator denotes the entry-wise product of vectors and 𝐖f represents the variable of the forget gate in GDU.

Meanwhile, for the input from the creator nodes, a new node-type “adjust gate” is introduced in GDU. Here, the term “adjust” models the necessary changes of information between different node categories (e.g., from creators to articles). Formally, the “adjust gate” as well as the updated input can be denoted as

𝐭~i=𝐞i𝐭i, where 𝐞i=σ(𝐖e[𝐱i,𝐳i,𝐭i]), (32)

where 𝐖e denotes the variable matrix in the adjust gate.

GDU allows different combinations of these input/state vectors, which are controlled by the selection gates 𝐠i and 𝐫i respectively. Formally, the final output of GDU will be

𝐡i =𝐠i𝐫itanh(𝐖u[𝐱i,𝐳~i,𝐭~i]) (33)
+(𝟏-𝐠i)𝐫itanh(𝐖u[𝐱i,𝐳i,𝐭~i])
+𝐠i(𝟏-𝐫i)tanh(𝐖u[𝐱i,𝐳~i,𝐭i])
+(𝟏-𝐠i)(𝟏-𝐫i)tanh(𝐖u[𝐱i,𝐳i,𝐭i]),

where 𝐠i=σ(𝐖g[𝐱i,𝐳i,𝐭i]), and 𝐫i=σ(𝐖r[𝐱i,𝐳i,𝐭i]), and term 𝟏 denotes a vector filled with value 1. Operators and denote the entry-wise addition and minus operation of vectors. Matrices 𝐖u, 𝐖g, 𝐖r represent the variables involved in the components. Vector 𝐡i will be the output of the GDU model.

The introduced GDU model also works for both the news subjects and creator nodes in the network. When applying the GDU to model the states of the subject/creator nodes with two input only, the remaining input port can be assigned with a default value (usually vector 𝟎). In the following section, we will introduce how to learn the parameters involved in the DifNN model for concurrent inference of multiple nodes.

3.3.3 DifNN Model Learning

In the DifNN model as shown in Figure 8, based on the output state vectors of news articles, news creators and news subjects, the framework will project the feature vectors to their labels. Formally, given the state vectors 𝐡n,i of news article ni, 𝐡u,j of news creator uj, and 𝐡s,l of news subject sl, their inferred labels can be denoted as vectors 𝐲n,i,𝐲u,j,𝐲s,l|𝒴| respectively, which can be represented as

{𝐲n,i=softmax(𝐖n𝐡n,i),𝐲u,j=softmax(𝐖u𝐡u,j),𝐲s,l=softmax(𝐖s𝐡s,l). (34)

where 𝐖u, 𝐖n and 𝐖s define the weight variables projecting state vectors to the output vectors.

Meanwhile, based on the news articles in the training set 𝒯n𝒩 with the ground-truth label vectors {𝐲^n,i}ni𝒯n, the loss function of the framework for news article label learning are defined as the cross-entropy between the prediction results and the ground truth:

(𝒯n;𝚯) =-ni𝒯nk=1|𝒴|𝐲^n,i(k)log𝐲n,i(k). (35)

Similarly, the loss terms introduced by news creators and subjects based on training sets 𝒯u𝒰 and 𝒯s𝒮 can be denoted as

(𝒯u;𝚯) =-uj𝒯uk=1|𝒴|𝐲^u,j(k)log𝐲u,j(k), (36)
(𝒯s;𝚯) =-sl𝒯sk=1|𝒴|𝐲^s,l(k)log𝐲s,l(k), (37)

where 𝐲u,j and 𝐲^u,j (and 𝐲s,l and 𝐲^s,l) denote the prediction result vector and ground-truth vector of creator (and subject) respectively.

Formally, the main objective function of the DifNN model can be represented as follows:

min𝚯(𝒯n;𝚯)+(𝒯u;𝚯)+(𝒯s;𝚯)+αreg(𝚯), (38)

where 𝚯 denotes all the involved variables to be learned, term reg(𝚯) represents the regularization term (i.e., the sum of L2 norm on the variable vectors and matrices), and α denotes the regularization term weight. By resolving the optimization functions, variables in the model can be effectively learned with the back-propagation algorithm. For the news articles, creators and subjects in the testing set, their predicted labels will be outputted as the final result.

3.4 GNL: Graph Neural Lasso

Graph neural lasso (GNL) initially proposed in [1] is a graph neural regression model and it can effectively incorporate the historical time-series data of multiple instances for addressing the dynamic network regression problem. GNL extends the GDU neuron [8] (also introduced in Section 3.3.2) for incorporating both the network internal relationships and the network dynamic relationships between sequential network snapshots.

3.4.1 Dynamic Gated Diffusive Unit

GNL also adopts GDU as the basic neuron unit and extends it to the dynamic network regression problem settings, which can model both the network snapshot internal connections and the temporal dependency relationships between sequential network snapshots for the nodes.

Figure 10: The detailed architecture of the GDU neuron of node vi at timestamp τ.
Figure 11: The framework outline of GNL based on GDU.

Formally, given the time series data about a set of connected entities, such data can be represented as a dynamic network set 𝒢={G(1),G(2),,G(t)}, where t denotes the maximum timestamp. For each network G(τ)𝒢, it can be denoted as G(τ)=(𝒱(τ),(τ)) involving the node set 𝒱(τ) and link set (τ), respectively. Given a node vi in network G(τ), its in-neighbors and out-neighbors in the network can be denoted as sets Γin(vi)={vj|vj𝒱(τ)(vj,vi)(τ)} and Γout(vi)={vj|vj𝒱(τ)(vi,vj)(τ)}. Here, the link direction denotes the influences among the nodes. If the influences in the studied networks are bi-directional, the in/out neighbor sets will be identical, i.e., Γin(vi)=Γout(vi).

For node vi in network G(τ) of the τth timestamp, the input attribute values of vi can be denoted as an input feature vector 𝐱i(τ)dx. GDU maintains a hidden state vector for each node, and the vector of node vi at timestamp τ can be denoted as 𝐡i(τ)dh. As illustrated in Figure 11, besides the feature vector 𝐱i(τ) and hidden state vector 𝐡i(τ) inputs, the GDU neuron of vi will also accept the inputs from vi’s input neighbor nodes, i.e., {𝐡j(τ)}vjΓin(vi), which will be integrated via certain aggregation operators:

𝐳i(τ)=Aggregate({𝐡j(τ)}vjΓin(vi)). (39)

The Aggregate() operator used in GNL will be introduced in detail in the next subsection.

A common problem with existing graph neural network models is over-smoothing, which will reduce all the nodes in the network to similar hidden representations. Such a problem will be much more serious when the model involves a deep architecture with multiple layers. To resolve such a problem, besides the attention mechanism to be introduced later, GDU introduces several gates for the neural state adjustment as introduced in Section 3.3.2. Formally, based on the input vectors 𝐱i(τ), 𝐳i(τ) and 𝐡i(τ), the representation of node vi in the next timestamp τ+1 can be represented as

𝐡i(τ+1)=GDU(𝐱i(τ),𝐳i(τ),𝐡i(τ);𝚯). (40)

The concrete representation of the GDU() function is similar to Equ. (31)-(33) introduced before, and 𝚯 denotes the variables involved in the GDU neuron.

3.4.2 Attentive Neighborhood Influence Aggregation

In this part, we will introduce the Aggregate() operator used in Equ. (39) for node neighborhood influence integration proposed in [1]. The GNL model defines such an operator based on an attention mechanism. Formally, given the node vi and its in-neighbor set Γin(vi), for any node vjΓin(vi), GNL quantifies the influence coefficient of vj on vi based on their hidden state vectors 𝐡j(τ) and 𝐡i(τ) as follows:

αj,i(τ)=AttInf(ej,i(τ))=exp(ej,i(τ))vkΓout(vj)exp(ej,k(τ)), where ej,i(τ)=Linear(𝐖a𝐡j(τ)𝐖a𝐡i(τ);𝐰a). (41)

In the above equation, operator Linear(;𝐰a) denotes a linear sum of the input vector parameterized by weight vector 𝐰a. According to [5], out of the model learning concerns, the above influence coefficient term can be slightly changed by adding the LeakyReLU function into its definition. Formally, the final influence coefficient used in GNL is represented as follows:

αj,i(τ)=AttInf(𝐡j(τ),𝐡i(τ);𝐖a,𝐰a)=exp(LeakyReLU(Linear(𝐖a𝐡j(τ)𝐖a𝐡i(τ);𝐰a)))vkΓout(vj)exp(LeakyReLU(Linear(𝐖a𝐡j(τ)𝐖a𝐡k(τ);𝐰a))). (42)

Considering that in our problem setting the links in the dynamic networks are unknown and to be inferred, the above influence coefficient term αj,i actually quantifies the existence probability of the influence link (vj,vi), i.e., the inference results of the links. Furthermore, based on the influence coefficient, the concrete representation of Equ. (39) will be represented as follows:

𝐳i(τ)=Aggregate({𝐡j(τ)}vjΓin(vi))=σ(vjΓin(vi)αj,i(τ)𝐖a𝐡j(τ)). (43)

3.4.3 Graph Neural Lasso Model Learning

In this part, we will introduce the architecture of the GNL model together with its learning settings. Formally, given the input dynamic network set 𝒢={G(1),G(2),,G(t)}, GNL shifts a window of size τ along the networks in the order of their timestamps. The network snapshots covered by the window, e.g., G(k), G(k+1), , G(k+τ-1), will be taken as the model input of GNL to infer the network G(k+τ) in following timestamp (where k,k+1,,k+τ{1,2,,t}). According to the above descriptions, the inferred attribute values of all the nodes and their potential influence links in network G(k+τ) can be represented as

𝐱^i(τ+1)=FC(𝐡i(τ+1);𝚯),vi𝒱(τ+1); (44)
αj,i(τ)=AttInf(𝐡j(τ+1),𝐡i(τ+1);𝚯),vi,vj𝒱(τ+1),

In the above equation, term 𝐡i(τ+1) is defined in Equ. (40) and 𝚯 covers all the involved variables used in the GNL model. By comparing the inferred node attribute values, e.g., 𝐱^i(τ+1), against the ground truth values, e.g., 𝐱i(τ+1), the quality of the inference results by GNL can be effectively measured with some loss functions, e.g., mean square error:

(𝚯)=1|𝒱(τ+1)|vi𝒱(τ+1)(vi;𝚯)=1|𝒱(τ+1)|vi𝒱(τ+1)𝐱^i(τ+1)-𝐱i(τ+1)22. (45)

In addition, similar to Lasso, to avoid overfitting, GNL proposes to add a regularization term in the objective function to maintain the sparsity of the variables. Formally, the final objective function of the GNL model can be represented as follows:

min𝚯(𝚯)+β𝚯1, (46)

where term 𝚯1 denotes the sum of the L1-norm regularizer of all the involved variables in the model and β is the hyper-parameter weight of the regularization term. More information about the model learning as well as how to handle the non-derivable L1-norm regularization term is available in [1].

3.5 GraphSage: Graph Sample and Aggregate

GraphSage introduced in [2] is an inductive model which focuses on leveraging node feature information for effective network node embedding. Instead of training individual embedding for each node, GraphSage learns a function that generate embeddings by sampling and aggregating features from nearby neighbors.

3.5.1 Framework Descriptions

{algorithm}

[t] Algorithm GraphSage \[email protected]@algorithmic[1] \REQUIRENetwork G=(𝒱,); Input Node Feature: {𝐱i}vi𝒱; Model Depth: K. \ENSURELearned Representations {𝐳i}vi𝒱. \STATEInitialize 𝐡i(0)=𝐱i,vi𝒱 \FORk{1,2,,K} \FORvi𝒱 \STATE𝒩(vi)=Sample(Γ(vi)) \STATE𝐡Γ(vi)(k)=Aggregate({𝐡j(k-1)|vj𝒩(vi)}) \STATE𝐡i(k)=σ(𝐖(k)(𝐡Γ(vi)(k)𝐡i(k-1))) \ENDFOR\STATE𝐡i(k)=Normalize(𝐡i(k)) \ENDFOR\STATE𝐳i=𝐡i(K),vi𝒱

As illustrated in Algorithm 3.5.1, the GraphSage algorithm accepts the network structure G=(𝒱,), input raw feature vectors {𝐱i}vi𝒱 and model depth K as the inputs, which will return the learned representations of nodes in the network as the output. Several functions will be called in the algorithm, including Sample(), Aggregate() and Normalize().

The forward computational process of GraphSage works iteratively layers by layers and nodes by nodes. Formally, the representations of nodes at each layer can be represented as vectors {𝐡i(k)}vi𝒱,k{1,2,,K}, where 𝐡i(k) denotes the representation of vi at the kth layer. For all the nodes, their representations at layer 0 are initialized with their raw features, i.e., 𝐡i(0)=𝐱i,vi𝒱. Given a node vi, its neighbors can be represented as set Γ(vi)={vj|vj𝒱(vi,vj)}. GraphSage calls an aggregation function to effectively aggregate the neighbors’ representations. However, instead of directly working on the complete neighbor set Γ(vi), GraphSage proposes to sample a subset of the neighbors prior to information aggregation, which is denoted as

𝒩(vi)=Sample(Γ(vi)) (47)

where 𝒩(vi)Γ(vi) has a fixed size for all the nodes in the network and the sampling process follows a uniform distribution.

At the kth layer, via aggregating all the representations of the nodes in 𝒩(vi) from the (k-1)th layer, GraphSage defines the a pseudo-representation for vi as follows:

𝐡Γ(vi)(k)=Aggregate({𝐡j(k-1)|vj𝒩(vi)}). (48)

The concrete representations of the Aggregate() function will be introduced later.

By concatenating the computed pseudo-representation 𝐡Γ(vi)(k) and its representation at the (k-1)th layer, GraphSage defines the node representation updating equation as follows:

𝐡i(k) =σ(𝐖(k)(𝐡Γ(vi)(k)𝐡i(k-1))), (49)
𝐡i(k) =Normalize(𝐡i(k))=𝐡i(k)𝐡i(k)2, (50)

where operator denotes the concatenation of two vectors and GraphSage normalizes the vector 𝐡i(k) by dividing it with its modulus.

3.5.2 Aggregator Function

The “orderless” property of the the neighbor nodes poses more challenges on the aggregation operator to be used in GraphSage. Besides the aggregate function used above, several other aggregators can also be used for the information integration, which are provided as follows:

  • Mean Aggregator: The mean aggregator is very similar to the propagation rules used in GCN introduced in Section 3.1, which replaces Equ. (48) and Equ. (49) with the following equation instead:

    𝐡i(k)=σ(𝐖(k)Mean({𝐡i(k-1)}{𝐡j(k-1)|vj𝒩(vi)})). (51)
  • LSTM Aggregator: Much more complex aggregators, e.g., LSTM, can also be adopted for the nodes representation aggregation and updating in GraphSage. By randomly permuting the neighbors in set Γ(vi), LSTM can be applied on the unordered set, where the output of the last unit can be obtained as the output. In other words, Equ. (48) can be updated as follows:

    𝐡Γ(vi)(k)=LSTM({𝐡j(k-1)|vj𝒩(vi)}) (52)
  • Pooling Aggregator: The pooling aggregator works with a max pooling layer to integrate the information from the neighbors, and Equ. (48) can be replaced as follows:

    𝐡Γ(vi)(k)=max({σ(𝐖(k)𝐡j(k-1)+𝐛(k))|vj𝒩(vi)}) (53)

3.6 seGEN: Sample and Ensemble Genetic Evolutionary Network

Figure 12: The seGEN Framework [9] with Three Main Steps (Step 1: graph sampling to achieve a set of sub-graph instances; Step 2: sub-graph representation learning to get the representation features of nodes; Step 3: result ensemble of the learned node representations on each sub-graph by the unit models to get the final node representation).

Sample-Ensemble Genetic Evolutionary Network (seGEN) first proposed in [9] can serve as an alternative approach to GNN models on giant networks. Instead of building one single graph neural network, based on a set of sampled sub- graphs, seGEN adopts a genetic-evolutionary learning strategy to build a group of unit models generations by generations. The unit models incorporated in seGEN can be either traditional graph representation learning models or the recent graph neural network models with a much “narrower” and “shallower” architecture. The learning results of each instance at the final generation will be effectively combined from each unit model via diffusive propagation and ensemble learning strategies.

3.6.1 seGEN Architecture Description

In framework seGEN, instead of handling the input large-scale graph data directly, it proposes to sample a subset (of set size s) of small-sized sub-graphs (of a pre-specified sub-graph size k) instead and learn the representation feature vectors of nodes based on the sub-graphs. To ensure the learned representations can effectively represent the characteristics of nodes, seGEN need to ensure the sampled sub-graphs share similar properties as the original large-sized input graph. As shown in Figure 12, five different types of graph sampling strategies (indicated in five different colors) are adopted, and each strategy will lead to a group of small-sized sub-graphs, which can capture both the local and global structures of the original graph. According to [9], the sampled sub-graphs based on different sampling strategies, e.g., BFS, DFS, BNS, BES and HS, can be represented as 𝒢bfs, 𝒢dfs, 𝒢hs, 𝒢ns or 𝒢es, respectively.

Instead of fitting each unit model with all the sub-graphs in the pool 𝒢, in the unit model, a set of sub-graph training batches 𝒯1,𝒯2,,𝒯m will be sampled for each unit model respectively in the learning process, where |𝒯i|=b,i{1,2,,m} are of the pre-defined batch size b. These batches may share common sub-graphs as well, i.e., 𝒯i𝒯j may not necessary be . In the model, the unit models learning process for each generation involves two steps: (1) generating the batches 𝒯i from the pool set 𝒢 for each unit model Mi11, and (2) learning the variables of the unit model Mi1 based on sub-graphs in batch 𝒯i. Considering that the unit models have a much smaller number of hidden layers, the learning time cost of each unit model will be much less than the deeper models on larger-sized graphs.

In the following parts, we will first introduce the learning process of the model, which accepts each sub-graph pool as the input and learns the representation feature vectors of nodes as the output. We can use 𝒢 to represent the sampled pool set, which can be 𝒢bfs, 𝒢dfs, 𝒢hs, 𝒢ns or 𝒢es respectively. The learned results can be further fused together with a hierarchical ensemble process to be introduced at the last subsection.

3.6.2 Genetic Evolutionary Learning of seGEN

The training process of seGEN involves several key steps, including unit model evaluation and selection, crossover and mutation, which will be introduced as follows:

  • Evaluation and Selection: The unit models in the generation set 1 can have different performance, due to (1) different initial variable values, and (2) different training batches in the learning process. In framework seGEN, instead of applying “deep” models with multiple hidden layers, it proposes to “deepen” the models in another way: “evolve the unit model into ‘deeper’ generations”. For each unit model Mk11, based on the sub-graphs in a validation set 𝒱, seGEN measures the introduced loss of the model as

    (Mk1;𝒱)=g𝒱vi,vj𝒱g,vivjsi,j𝐳k,i1-𝐳k,j122,

    where 𝐳k,i1 and 𝐳k,j1 denote the learned latent representation feature vectors of nodes vi,vj in the sampled sub-graph g. Term si,j has value +1 if vi and vj are connected in subgraph g, otherwise si,j will be assigned with value 0 instead.

    The probability for each unit model to be picked as the parent model for the crossover and mutation operations can be represented as

    p(Mk1)=exp-(Mk1;𝒱)Mi11exp-(Mi1;𝒱).

    In the real-world applications, a normalization of the loss terms among these unit models is necessary. For the unit model introducing a smaller loss, it will have a larger chance to be selected as the parent unit model. Considering that the crossover is usually done based a pair of parent models, the pairs of parent models selected from set 1 can be denoted as 𝒫1={(Mi1,Mj1)k}k{1,2,,m}, based on which seGEN will be able to generate the next generation of unit models, i.e., 2.

  • Crossover: For the kth pair of parent unit model (Mi1,Mj1)k𝒫1, their genes can be denoted as their variables θi1,θj1 respectively (since the differences among the unit models mainly lie in their variables), which are actually their chromosomes for crossover and mutation.

    seGEN proposes to adopt the uniform crossover to get the chromosomes (i.e., the variables) of their child model. Considering that the parent models Mi1 and Mj1 can actually achieve different performance on the validation set 𝒱, in the crossover, the unit model achieving better performance should have a larger chance to pass its chromosomes to the child model.

    Formally, the chromosome inheritance probability for parent model Mi1 can be represented as

    p(Mi1)=exp-(Mi1;𝒱)exp-(Mi1;𝒱)+exp-(Mj1;𝒱)

    Meanwhile, the chromosome inheritance probability for model Mj1 can be denoted as p(Mj1)=1-p(Mi1).

    In the uniform crossover method, based on parent model pair (Mi1,Mj1)k𝒫1, the obtained child model chromosome vector can be denoted as θk2|θ1| (the superscript denotes the 2nd generation and |θ1| denotes the variable length), which is generated from the chromosome vectors θi1 and θj1 of the parent models. Meanwhile, the crossover choice at each position of the chromosomes vector can be represented as a vector 𝐜{i,j}|θ1|. The entries in vector 𝐜 are randomly selected from values in {i,j} with a probability p(Mi1) to pick value i and a probability p(Mj1) to pick value j respectively. The lth entry of vector θk2 before mutation can be represented as

    θ^k2(l)=𝟙(c(l)=i)θi1(l)+𝟙(c(l)=j)θj1(l),

    where indicator function 𝟙() returns value 1 if the condition is True; otherwise, it returns value 0.

  • Mutation: The variables in the chromosome vector θ^k2(l)|θ1| are all real values, and some of them can be altered, which is also called mutation in traditional genetic algorithm. Mutation happens rarely, and the chromosome mutation probability is γ in the model. Formally, the mutation indicator vector can be denoted as 𝐦{0,1}d, and the lth entry of vector θk2 after mutation can be represented as

    θk2(l)=𝟙(m(l)=0)θ^k2(l)+𝟙(c(l)=1)rand(0,1),

    where rand(0,1) denotes a random value selected from range [0,1]. Formally, the chromosome vector θk2 defines a new unit model with knowledge inherited form the parent models, which can be denoted as Mk2. Based on the parent model set 𝒫1, all these newly generated models can be represented as 2={Mk2}(Mi1,Mj1)k𝒫1, which will form the 2nd generation of unit models.

3.6.3 Result Ensemble

Based on the models introduced in the previous subsection, seGEN adopts a hierarchical result ensemble method, which involves two steps: (1) local ensemble of results for the sub-graphs on each sampling strategies, and (2) global ensemble of results obtained across different sampling strategies.

  • Local Ensemble: Based on the sub-graph pool 𝒢 obtained via the sampling strategies introduced before, we have learned the Kth generation of the unit model K (or for simplicity), which contains m unit models. Formally, given a sub-graph g𝒢 with node set 𝒱g, by applying unit model Mj to g, the learned representation for node vq𝒱g can be denoted as vector 𝐳j,q, where q denotes the unique node index in the original complete graph G before sampling. For the nodes vp𝒱g, its representation vector will be 𝐳j,p=𝐧𝐮𝐥𝐥, which denotes a dummy vector of length d. Formally, the learned representation feature vector for node vq can be represented as

    𝐳q=g𝒢,Mj,𝐳j,q, (54)

    where operator denotes the concatenation operation of feature vectors. Considering that in the graph sampling step, not all nodes will be selected in sub-graphs. For the nodes vp𝒱g,g𝒢, seGEN introduces a local propagation approach to compute its representations based on its neighbors instead.

    Global Ensemble: Generally, these different graph sampling strategies can capture different local/global structures of the graph, which will all be useful for the node representation learning. In the global result ensemble step, seGEN proposes to group these features together as the output. Formally, for node vq in the original network, its fused representations can be denoted as

    𝐳¯q=i{bfs,dfs,hs,ns,es}wi𝐳qi, (55)

    where 𝐳qbfs, 𝐳qdfs, 𝐳qhs, 𝐳qns and 𝐳qes are the vectors of vq obtained from the above local ensemble based on different graph sampling strategies. In [9], seGEN will simply assign them with equal weights, i.e., 𝐳¯q is an average of 𝐳qbfs, 𝐳qdfs, 𝐳qhs, 𝐳qns and 𝐳qes.

4 Challenges and Opportunities

We have also observed many challenges with graph neural network studies, which provide plenty of opportunities for researchers interested in this topic:

4.1 Over-Smoothing Problem

The existing graph neural networks on giant network representation learning problem suffer from the over-smoothing problem a lot. For instance, if a GCN is deep with many convolutional layers, the output features may be over-smoothed and vertices from different clusters may become indistinguishable, which will render the GCN model fail to work. We have also observed some approaches proposed to resolve such a problem. In [1, 8], both the DifNN and GNL models introduce a set of gates (i.e., the GDU neuron unit) to ensure the nodes can capture the raw inputs, neighbors’ influences and the temporal dynamic states in the learning process, which can resolve the over-smoothing problem effectively.

4.2 Optimization Efficiency

The time complexity of the graph neural network learning, including those for small graphs and giant networks, can be very high. For the small graph oriented graph neural networks, e.g., IsoNN, the major time is spent on enumerating the permutation matrices for the isomorphic feature calculation. Meanwhile, for the giant network oriented graph neural networks, e.g., GCN and GAT, most of the time is spent on the backpropagation along the graph edges for the nodes, which grows almost quadratically as the network size increases and exponentially as the model goes deeper. New optimization algorithms that can address the high learning cost will be necessary and desired.

4.3 The Gap Between Small Graphs and Giant Networks

By this context so far, we haven’t witnessed any graph neural networks that can learn effective representations for both the small graphs and the giant networks simultaneously without any architecture modifications. Proposing a new unified graph neural network model that can work for different types of networks will be desired.

4.4 Graph Neural Network for Dynamic Networks

Most of the network data in the real-world are not static, which keep changing with time. We have observed the GNL model [1] as the first work focusing on the dynamic regression scenario. Such kinds of model can be important, and it may also serve as the tool for analyzing and understanding the brain activities, which is a dynamic network with both structure and states changing all the time.

4.5 Graph Neural Network for Complex Networks

Most of the graph neural networks proposed so far mainly focus on the homogeneous network, which over-simplify the learning setting, since most of the network data in the real world are heterogeneous instead. Generally, in heterogeneous networks, different types of nodes and links can convey different kinds of physical meanings. New models that can effective incorporate such heterogeneous information in the learning process can be desired for concrete real-world applications of graph neural networks.

5 Summary

In this paper, we have introduced the latest graph neural networks proposed for resolving the small graph and giant network oriented research problems. The small graph oriented graph neural network models introduced in this paper include IsoNN, SDBN and LF&ER; whereas, the giant network oriented graph neural network models introduced in this paper include GCN, GAT, DifNN, GNL, GraphSage and seGEN. In addition, we have also introduced several challenges and opportunities on graph neural network studies. This paper will be updated shortly as we observe the new development on this topic in the near future.


References

  • [1] Yixin Chen, Lin Meng, and Jiawei Zhang. Graph neural lasso for dynamic network regression. CoRR, abs/1907.11114, 2019.
  • [2] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. CoRR, abs/1706.02216, 2017.
  • [3] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
  • [4] Lin Meng and Jiawei Zhang. Isonn: Isomorphic neural network for graph representation learning and classification. CoRR, abs/1907.09495, 2019.
  • [5] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018.
  • [6] Pengyang Wang, Jiawei Zhang, Guannan Liu, Yanjie Fu, and Charu Aggarwal. Ensemble-Spotting: Ranking Urban Vibrancy via POI Embedding with Multi-view Spatial Graphs, pages 351–359.
  • [7] Shen Wang, Lifang He, Bokai Cao, Chun-Ta Lu, Philip S. Yu, and Ann B. Ragin. Structural deep brain network mining. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 475–484, New York, NY, USA, 2017. ACM.
  • [8] Jiawei Zhang, Limeng Cui, Yanjie Fu, and Fisher B. Gouza. Fake news detection with deep diffusive network model. CoRR, abs/1805.08751, 2018.
  • [9] Jiawei Zhang, Limeng Cui, and Fisher B. Gouza. SEGEN: sample-ensemble genetic evolutional network model. CoRR, abs/1803.08631, 2018.