A High-Throughput Solver for Marginalized Graph Kernels on GPU

Abstract

We present the design of a solver for the efficient and high-throughputcomputation of the marginalized graph kernel on General Purpose GPUs. The graphkernel is computed using conjugate gradient to solve a generalized Laplacian ofthe tensor product between a pair of graphs. To cope with the large gap betweenthe instruction throughput and the memory bandwidth of the GPUs, our solverforms the graph tensor product on-the-fly without storing it in memory. This isachieved by using threads in a warp cooperatively to stream the adjacency andedge label matrices of individual graphs by small square tiles, which are thenstaged in registers and the shared memory for later reuse. Warps across athread block can further share tiles via the shared memory to increase datareuse. The sparsity of the graphs is exploited hierarchically by storing onlynon-empty tiles of the graphs and non-zero elements within each tile using acoordinate format and bitmaps, respectively. A new partition-based reorderingalgorithm is proposed for aggregating non-zero elements of the graphs intofewer but denser tiles in order to improve the efficiency of the sparse format.We carried out extensive theoretical analyses on the graph tensor productprimitives between tiles of various density, and evaluated their performance onsynthetic and real-world datasets. Our implementation is able to deliver threeto four orders of magnitude speedup over existing CPU-based solvers. Thecapability of the solver can enable kernel-based learning tasks atunprecedented scales.

Quick Read (beta)

loading the full paper ...