### Abstract

Finding the biomarkers associated with ASD is helpful for understanding theunderlying roots of the disorder and can lead to earlier diagnosis and moretargeted treatment. A promising approach to identify biomarkers is using GraphNeural Networks (GNNs), which can be used to analyze graph structured data,i.e. brain networks constructed by fMRI. One way to interpret importantfeatures is through looking at how the classification probability changes ifthe features are occluded or replaced. The major limitation of this approach isthat replacing values may change the distribution of the data and lead toserious errors. Therefore, we develop a 2-stage pipeline to eliminate the needto replace features for reliable biomarker interpretation. Specifically, wepropose an inductive GNN to embed the graphs containing different properties oftask-fMRI for identifying ASD and then discover the brain regions/sub-graphsused as evidence for the GNN classifier. We first show GNN can achieve highaccuracy in identifying ASD. Next, we calculate the feature importance scoresusing GNN and compare the interpretation ability with Random Forest. Finally,we run with different atlases and parameters, proving the robustness of theproposed method. The detected biomarkers reveal their association with socialbehaviors. We also show the potential of discovering new informativebiomarkers. Our pipeline can be generalized to other graph feature importanceinterpretation problems.1

### Quick Read (beta)

# Graph Neural Network for Interpreting Task-fMRI Biomarkers

###### Abstract

Finding the biomarkers associated with ASD is helpful for understanding the underlying roots of the disorder and can lead to earlier diagnosis and more targeted treatment. A promising approach to identify biomarkers is using Graph Neural Networks (GNNs), which can be used to analyze graph structured data, i.e. brain networks constructed by fMRI. One way to interpret important features is through looking at how the classification probability changes if the features are occluded or replaced. The major limitation of this approach is that replacing values may change the distribution of the data and lead to serious errors. Therefore, we develop a 2-stage pipeline to eliminate the need to replace features for reliable biomarker interpretation. Specifically, we propose an inductive GNN to embed the graphs containing different properties of task-fMRI for identifying ASD and then discover the brain regions/sub-graphs used as evidence for the GNN classifier. We first show GNN can achieve high accuracy in identifying ASD. Next, we calculate the feature importance scores using GNN and compare the interpretation ability with Random Forest. Finally, we run with different atlases and parameters, proving the robustness of the proposed method. The detected biomarkers reveal their association with social behaviors. We show the potential of discovering new informative biomarkers. Our pipeline can be generalized to other graph feature importance interpretation problems.

^{1}

^{1}footnotetext: This work was supported by NIH Grant R01NS035193.

## 1 Introduction

Autism spectrum disorders (ASD) affect the structure and function of the brain. To better target the underlying roots of ASD for diagnosis and treatment, efforts to identify reliable biomarkers are growing [8]. Significant progress has been made using functional magnetic resonance imaging (fMRI) to characterize the brain remodeling in ASD [9]. Recently, emerging research on Graph Neural Networks (GNNs) has combined deep learning with graph representation and applied an integrated approach to fMRI analysis in different neuro-disorders [11]. Most existing approaches (based on Graph Convolutional Network (GCN) [10]) require all nodes in the graph to be present during training and thus lack natural generalization on unseen nodes. Also, it is necessary to interpret the important feature in the data used as evidence for the model, but currently no tool exists that can interpret and explain GNNs while recent CNN explanation algorithms cannot directly work on graph input.

Our main contributions include the following three points: 1) We develop a method to integrate all the available connectivity, geometric, anatomic information and task-fMRI (tfMRI) related parameters into graphs for deep learning. Our approach alleviates the problem of predetermining the best features and measures of functional connectivity, which is often ambiguous due to the intrinsic complex structure of task-fMRI. 2) We propose a generalizable GNN inductive learning model to more accurately classify ASD v.s. healthy controls (HC). Different from the spectral GCN [10], our GNN classifier is based on graph isomorphism, which can be applied to multigraphs with different nodes/edges (e.g. sub-graphs), and learn local graph information without binding to the whole graph structure. 3) The GNN architecture enables us to train the model on the whole graph and validate it on subgraphs. We directly evaluate the importance scores on sub-graphs and nodes (i.e. regions of interest (ROIs)) by examining model responses, without resampling value for the occluded features. The 2-stage pipeline to interpret important sub-graphs/ROIs, which are defined as biomarkers in our setting, is shown in Fig. 1.

## 2 Methodology

### 2.1 Graph Definition

We firstly parcellate the brain into $N$ ROIs based on its T1 structural MRI. We define ROIs as graph nodes. We define an undirected multigraph $\bm{G}=(\bm{V},\bm{E})$, where $\bm{V}={({\overrightarrow{v}}_{1},{\overrightarrow{v}}_{2},\mathrm{\dots},{\overrightarrow{v}}_{N})}^{T}\in {\mathbb{R}}^{N\times D}$ and $\bm{E}=[{\overrightarrow{e}}_{ij}]\in {\mathbb{R}}^{N\times N\times F}$, ${D}$ and ${F}$ are the attribute dimensions of nodes and edges respectively. For node attributes, we concatenate handcrafted features: degree of connectivity, General Linear Model (GLM) coefficients, mean, standard deviation of task-fMRI, and ROI center coordinates. We applied the Box-Cox transformation [13] to make each feature follow a normal distribution (parameters are learned from the training set and applied to the training and testing sets). The edge attribute ${\bm{e}}_{ij}$ of node $i$ and $j$ includes the Pearson correlation, partial correlation calculated using residual fMRI, and $exp(-{r}_{ij}/10)$ where ${r}_{ij}$ is the geometric distance between the centers of the two ROIs. We removed the edges under the 95th percentile of partial correlation values to ensure sparsity for efficient computation and avoiding oversmoothing.

### 2.2 Graph Neural Network (GNN) Classifier

The architecture of our proposed GNN is shown in Fig. 2 (node, edge attribute definition, kernel sizes are denoted). The model inductively learns node representation by recursively aggregating and transforming feature vectors of its neighboring nodes. Below, we define the layers in the proposed GNN classifier.

#### 2.2.1 Convolutional Layer

Following Message Passing Neural Networks (NNconv) [7], which is invariant to graph symmetries, we leverage node degree in the embedding. The embedded representation of the $l$th convolutional layer ${\overrightarrow{v}}_{i}^{(l)}\in {\mathbb{R}}^{{d}^{(l)}}$ is

$${\overrightarrow{v}}_{i}^{(l)}=\frac{1}{|\mathcal{N}(i)|+1}\sigma (\mathbf{\Theta}{\overrightarrow{v}}_{i}^{(l-1)}+\sum _{j\in \mathcal{N}(i)}{h}_{\varphi}({\overrightarrow{e}}_{ij}){\overrightarrow{v}}_{j}^{(l-1)}),$$ | (1) |

where $\sigma (\cdot )$ is a nonlinear activation function (we use relu here), $\mathcal{N}(i)$ is node $i$’s 1-hop neighborhood, $\mathbf{\Theta}\in {\mathbb{R}}^{{d}^{(l)}\times {d}^{(l-1)}}$ is a learnable propagation matrix, ${h}_{\varphi}$ denotes a Multi-layer Perceptron (MLP), which maps the edge attributes ${\bm{e}}_{ij}$ to a ${d}^{(l)}\times {d}^{(l-1)}$ matrix, and we initialize ${\overrightarrow{v}}_{i}^{(0)}={\overrightarrow{v}}_{i}$.

#### 2.2.2 Pooling Aggregation Layer

To make sure that down-sampling layers behave idiomatically with respect to different graph sizes and structures, we adopt the approach in [2] for reducing graph nodes. The choice of which nodes to drop is done based on projecting the node attributes on a learnable vector ${\overrightarrow{w}}^{(l-1)}\in {\mathbb{R}}^{{d}^{(l-1)}}$. The nodes receiving lower scores will experience less feature retention. Fully written out, the operation of this pooling layer computing a pooled graph, $({\bm{V}}^{(l)},{\bm{E}}^{(l)})$, from an input graph, $({\bm{V}}^{(l-1)},{\bm{E}}^{(l-1)})$, is expressed as follows:

$$\overrightarrow{y}=\frac{{\bm{V}}^{(l-1)}{\overrightarrow{w}}^{(l-1)}}{\parallel {\overrightarrow{w}}^{(l-1)}\parallel}\hspace{1em}\overrightarrow{i}=\text{top}k(\overrightarrow{y},k)\hspace{1em}{\bm{V}}^{(l)}={({\bm{V}}^{(l-1)}\odot \mathrm{tanh}(\overrightarrow{y}))}_{\overrightarrow{i},:}\hspace{1em}{\bm{E}}^{(l)}={\bm{E}}_{\overrightarrow{i},\overrightarrow{i}}^{(l-1)}.$$ | (2) |

Here $\parallel \cdot \parallel $ is the ${L}_{2}$ norm, top$k$ finds the indices corresponding to the largest $k$ elements in vector $\overrightarrow{y}$, $\odot $ is (broadcasted) element-wise multiplication, and ${(\cdot )}_{\overrightarrow{i},\overrightarrow{j}}$ is an indexing operation which takes elements at row indices specified by $\overrightarrow{i}$ and column indices specified by $\overrightarrow{j}$ (colon denotes all indices). The pooling operation trivially retains sparsity by requiring only a projection, a point-wise multiplication and a slicing into the original feature and adjacency matrix. Different from [2], we induce constraint ${\parallel {\overrightarrow{w}}^{(l)}\parallel}_{2}=1$ implemented by adding an additional regularization loss $\lambda {\sum}_{l=1}^{L}{({\parallel {\overrightarrow{w}}^{(l)}\parallel}_{2}-1)}^{2}$ to avoid identifiability issues.

#### 2.2.3 Readout Layer

Lastly, we seek a “flattening” operation to preserve information about the input graph in a fixed-size representation. Concretely, to summarise the output graph of the $l$th conv-pool block, $({\bm{V}}^{(l)},{\bm{E}}^{(l)})$, we use

$${\overrightarrow{s}}^{(l)}=(\frac{1}{{N}^{(l)}}\sum _{i=1}^{{N}^{(l)}}{\overrightarrow{v}}_{i}^{(l)})\parallel \mathrm{max}(\{{\overrightarrow{v}}_{i}^{(l)}:i=1,\mathrm{\dots},{N}^{(l)}\}),$$ | (3) |

where ${N}^{(l)}$ is the number of graph nodes, ${\overrightarrow{v}}_{i}^{(l)}$ is the $i$th node’s feature vector, $\mathrm{max}$ operates elementwisely, and $\parallel $ denotes concatenation. The final summary vector is obtained as the concatenation of all those summaries (i.e. $\overrightarrow{s}={\overrightarrow{s}}^{(1)}\parallel {\overrightarrow{s}}^{(2)}\parallel \mathrm{\dots}\parallel {\overrightarrow{s}}^{(L)}$) and submitted to a MLP for obtaining final predictions.

### 2.3 Explain Input Data Sensitivity

To explain input data sensitivity, we cluster the whole brain graph into sub-graphs first. Then we investigate the predictive power of each sub-graph, further assign importance score to each ROI.

#### 2.3.1 Network Community Clustering

From now on we add the subscript to the graph as ${\bm{G}}_{s}=({\bm{V}}_{s},{\bm{E}}_{s})$ for the $s$th instance, $s=1,\mathrm{\dots},S$, where $S$ is the number of graphs. Concatenating the sparsified non-negative partial correlation matrices ${({\bm{E}}_{s})}_{:,:,2}$ for all the graphs, we can create a 3rd-order tensor $\tau $ of dimension $N\times N\times S$. Non-negative PARAFAC [3] tensor decomposition is applied to tensor $\tau $ to discover overlapping functional brain networks. Given decomposition rank $R$, $\tau \approx {\sum}_{j=1}^{R}{\overrightarrow{a}}_{j}\otimes {\overrightarrow{b}}_{j}\otimes {\overrightarrow{c}}_{j}$, where loading vectors ${\overrightarrow{a}}_{j}\in {\mathbb{R}}^{N}$, ${\overrightarrow{b}}_{j}\in {\mathbb{R}}^{N}$, ${\overrightarrow{c}}_{j}\in {\mathbb{R}}^{S}$ and $\otimes $ denotes the vector outer product. ${\overrightarrow{a}}_{j}={\overrightarrow{b}}_{j}$ since the connectivity matrix is symmetric. The $i$th element of ${\overrightarrow{a}}_{j}$, ${a}_{ji}$ provides the membership of region $i$ in the community $j$. Here, we consider region $i$ belongs to community $j$ if ${a}_{ji}>mean({\overrightarrow{a}}_{j})+std({\overrightarrow{a}}_{j})$ [12]. This gives us a collection of community indices indicating region membership $\{{\overrightarrow{i}}_{j}\subset \{1,\mathrm{\dots},N\}:j=1,\mathrm{\dots},R\}$.

#### 2.3.2 Graph Salience Mapping

After decomposing all the brain networks into community sub-graphs $\{{\bm{G}}_{sj}=({({\bm{V}}_{s})}_{{\overrightarrow{i}}_{j},:},{({\bm{E}}_{s})}_{{\overrightarrow{i}}_{j},{\overrightarrow{i}}_{j}}):s=1,\mathrm{\dots},S,j=1,\mathrm{\dots},R\}$, we use a salience mapping method to assign each sub-graph an importance score. In our classification setting, the probability of class $c\in \{0,1\}$ (0: HC, 1: ASD) given the original network $\bm{G}$ is estimated from the predictive score of the model: $p(c|\bm{G})$ . To calculate $p(c|{\bm{G}}_{sj})$, different from CNN or GCN, we can directly input the sub-graph into the pre-trained classifier. We denote ${c}_{s}$ as the class label for instance $s$ and define Evidence for Correct Class (ECC) for each community:

$${\text{\mathit{E}\mathit{C}\mathit{C}}}_{j}=\frac{1}{S}\sum _{s}\mathrm{tanh}({\mathrm{log}}_{2}(p(c={c}_{s}|{\bm{G}}_{sj})/(1-p(c={c}_{s}|{\bm{G}}_{sj})))),$$ | (4) |

where laplace correction ($p\leftarrow (pS+1)/(S+2)$) is used to avoid zero denominators. Note that log odds-ratio is commonly used in logistic regression to make ${p}$ more separable. The nonlinear tanh function is used for bounding ECC. ECC can be positive or negative. A positive value provides evidence for the classifier, whereas a negative value provides evidence against the classifier. The final importance score for node $k$ is calculated by ${\sum}_{j:k\in {\overrightarrow{i}}_{j}}{\text{\mathit{E}\mathit{C}\mathit{C}}}_{j}/|{\overrightarrow{i}}_{j}|$. The larger the score, the more possible the node can be used as a distinguishable marker.

## 3 Experiments and Results

### 3.1 Data Acquisition and Preprocessing

We tested our method on a group of 75 ASD children and 43 age and IQ-matched healthy controls collected at Yale Child Study Center. Each subject underwent a task fMRI scan (BOLD, TR = 2000 ms, TE = 25 ms, flip angle = ${60}^{\circ}$, voxel size $3.44\times 3.44\times 4m{m}^{3}$) acquired on a Siemens MAGNETOM Trio TIM 3T scanner. For the fMRI scans, subjects performed the ”biopoint” task, viewing point light animations of coherent and scrambled biological motion in a block design [9] ($24s$ per block). The fMRI data was preprocessed following the pipeline in [14].

The mean time series for each node were extracted from a random ${1}{/}{3}$ of voxels in the ROI (given an atlas) of preprocessed images by bootstrapping. We augmented the ASD data 10 times and the HC data 20 times, resulting in 750 ASD graphs and 860 HC graphs separately. We split the data into 5 folds based on subjects. Four folds were used as training data and the left out fold was used for testing. Based on the definition in Section 2.1, each node attribute ${\overrightarrow{v}}_{i}\in {\mathbb{R}}^{10}$ and each edge attribute ${\overrightarrow{e}}_{ij}\in {\mathbb{R}}^{3}$. Specifically, the GLM parameters of ”biopoint task” are: ${\beta}_{1}:$ coefficient of biological motion matrix; ${\beta}_{3}$: coefficient of scramble motion matrix; ${\beta}_{2}$ and ${\beta}_{4}$: coefficients of the previous two matrices’ derivatives.

### 3.2 Step 1: Train ASD/HC Classification Model

Firstly, we tested classifier performance on the Destrieux atlas [5] (148 ROIs) using proposed GNN. Since our pipeline integrated interpretation and classification, we apply a random forest (RF) using 1000 trees as an additional ”reality check”, while the other existing graph classification models either cannot achieve the performance as GNN [2, 7] or do not have straightforward and reliable interpretation ability [1]. We flattened the features to $\bm{V}\in {\mathbb{R}}^{1480}$ and $\bm{E}\in {\mathbb{R}}^{65712}$ ($65712=148\times 148\times 3$) and used this input to the RF. In our GNN, ${{d}}^{{(}{0}{)}}{=}{D}{=}{10}$, ${{d}}^{{(}{1}{)}}{=}{16}{,}{{d}}^{{(}{2}{)}}{=}{8}$, resulting in 2746 trainable parameters and we tried different pooling ratios $r$ ($k=r\times N)$ in Fig. 2, which was implemented based on [6]. We applied softmax after the network output and combined cross entropy loss and regularization loss with $\lambda =0.001$ as the objective function. We used the Adam optimizer with initial learning 0.001, then decreased it by a factor of $10$ every 50 epochs. We trained the network 300 epochs for all of the splits and measured the instance classification by accuracy, F-score, precision and recall (see Table 1). Our proposed model significantly outperformed the alternative method, due to its ability to embed high dimensional features based on the structural relationship. We selected the best GNN model with $r=0.5$ in the next step: interpreting biomarkers.

Model | RF(V) | RF(E) | RF(V+E) | GNN(r=0.3) | GNN(r=0.5) | GNN(r=0.8) |
---|---|---|---|---|---|---|

Accuracy | $0.71\pm 0.05$ | $0.66\pm 0.06$ | $0.68\pm 0.06$ | $0.67\pm 0.14$ | $\mathbf{0.76}\mathbf{\pm}\mathbf{0.06}$ | $0.73\pm 0.07$ |

F-score | $0.69\pm 0.06$ | $0.68\pm 0.06$ | $0.63\pm 0.12$ | $0.68\pm 0.09$ | $\mathbf{0.79}\mathbf{\pm}\mathbf{0.08}$ | $0.71\pm 0.10$ |

Precision | $0.68\pm 0.06$ | $0.61\pm 0.06$ | $0.69\pm 0.12$ | $0.65\pm 0.19$ | $\mathbf{0.76}\mathbf{\pm}\mathbf{0.12}$ | $0.68\pm 0.08$ |

Recall | $0.73\pm 0.12$ | $0.76\pm 0.10$ | $0.77\pm 0.09$ | $0.74\pm 0.07$ | $\mathbf{0.82}\mathbf{\pm}\mathbf{0.06}$ | $0.75\pm 0.08$ |

### 3.3 Step 2: Interpret and Explain Biomarkers

We put forth the hypothesis that the more accurate the classifier, the more reliable biomarkers can be found. We used the best RF model using $\bm{V}$ as inputs ($77.4\%$ accuracy on testing set) and used the RF-based feature importance (mean Gini impurity decrease) as a form of standard method for comparison. For GNN interpretation, we also chose the best model ($83.6\%$ accuracy on testing set). Further, to be comparable with RF, all of the interpretation experiments were performed on the training set only. The interpretation results are shown in Fig. 3, where the top 30 important ROIs (averaged over node features and instances) selected by RF are shown in yellow and the top 30 important ROIs selected by our proposed GNN in red. Nine important ROIs were selected by both methods. In addition, for node attribute importance, we averaged the importance score over ROIs and instances for RF. For GNN, we averaged gradient explanation over all the nodes and instances, i.e. $\mathbb{E}(\frac{1}{N}{\sum}_{i}|\frac{\partial y}{\partial {v}_{ij}}|)$, where $y=p(c=1|\bm{G})$, which quantifies the sensitivity of the ${j}$th node attribute. From Fig. 3(c) we show the relative importance to the most important node attribute, our proposed method assigned more uniform importance to each node attribute, among which the biological motion parameter ${\beta}_{1}$ was the most important. In addition, similar features, mean/std of task-fMRI (tf_mean/tf_std) and coordinates $(x,y,z)$, have similar scores, which makes more sense for human interpretation. Notice that our proposed pipeline is also able to identify sub-graph importance from Eq. (4), which is helpful for understanding the interaction between different brain regions. We selected the top 2 sub-graphs ($R$=20) and used Neurosynth [15] to decode the functional keywords associated with the sub-graphs (shown in Fig. 4). These networks are both associated with high-level social behaviors. To illustrate the predictive power of the 2 sub-graphs, we retrained the network using the graph slicing on those 19 ROIs of the 2 sub-graphs as input. Accuracy on the testing set (in the split of the best model) was $78.9\%$, achieving comparable performance to using the whole graph.

### 3.4 Evaluation: Robustness Discussion

To examine the potential influence of different graph building strategies on the reliability of network estimates, the functional and anatomical data were registered and parcellated by the Destrieux atlas (A1) and the Desikan-Killiany atlas (A2) [4]. We also showed the robustness of the results with respect to the number of clusters for $R=10,20,30$. The results are shown in Fig. 5. We ranked ECCs for each node and indicated the top 30 ROIs in A1 and top 15 ROIs in A2. The altas and number of clusters are indicated on the left of each sub-figure. Orbitofrontal cortex and ventromedial prefrontal cortex are selected in all the cases, which are social motivation related and have previously been shown to be associated with ASD [9]. We also validated the results by decoding the neurological functions of the important ROIs overlapped with Neurosynth.

## 4 Conclusion and Future Work

We proposed a framework to discover ASD brain biomarkers from task-fMRI using GNN. It achieved improved accuracy and more interpretable features than the baseline method. We also showed our method performed robustly on different atlases and hyper-parameters. The pipeline can be generalized to other feature importance analysis problems, such as resting-fMRI biomarker discovery and vessel cancer detection.

## References

- [1] Adebayo, J., et al.: Sanity checks for saliency maps. In: Advances in Neural Information Processing Systems. pp. 9505–9515 (2018)
- [2] Cangea, C., et al.: Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287 (2018)
- [3] Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika 35(3), 283–319 (1970)
- [4] Desikan, R.S., et al.: An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage 31(3), 968–980 (2006)
- [5] Destrieux, C., et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. Neuroimage 53(1), 1–15 (2010)
- [6] Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. CoRR abs/1903.02428 (2019)
- [7] Gilmer, J., et al.: Neural message passing for quantum chemistry. In: ICML 2017. pp. 1263–1272. JMLR. org (2017)
- [8] Goldani, A.A., et al.: Biomarkers in autism. Frontiers in psychiatry 5 (2014)
- [9] Kaiser, M.D., et al.: Neural signatures of autism. PNAS (2010)
- [10] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
- [11] Ktena, S.I., et al.: Distance metric learning using graph convolutional networks: Application to functional brain networks. In: MICCAI (2017)
- [12] Loe, C.W., Jensen, H.J.: Comparison of communities detection algorithms for multiplex. Physica A: Statistical Mechanics and its Applications 431, 29–45 (2015)
- [13] Nishii, R.: Box-Cox transformation. Encyclopedia of Mathematics (2001)
- [14] Yang, D., et al.: Brain responses to biological motion predict treatment outcome in young children with autism. Translational psychiatry 6(11), e948 (2016)
- [15] Yarkoni, T., et al.: Large-scale automated synthesis of human functional neuroimaging data. Nature methods 8(8), 665 (2011)