DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Abstract

Graph neural networks (GNN) have shown great success in learning fromgraph-structured data. They are widely used in various applications, such asrecommendation, fraud detection, and search. In these domains, the graphs aretypically large, containing hundreds of millions of nodes and several billionsof edges. To tackle this challenge, we develop DistDGL, a system for trainingGNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on theDeep Graph Library (DGL), a popular GNN development framework. DistDGLdistributes the graph and its associated data (initial features and embeddings)across the machines and uses this distribution to derive a computationaldecomposition by following an owner-compute rule. DistDGL follows a synchronoustraining approach and allows ego-networks forming the mini-batches to includenon-local nodes. To minimize the overheads associated with distributedcomputations, DistDGL uses a high-quality and light-weight min-cut graphpartitioning algorithm along with multiple balancing constraints. This allowsit to reduce communication overheads and statically balance the computations.It further reduces the communication by replicating halo nodes and by usingsparse embedding updates. The combination of these design choices allowsDistDGL to train high-quality models while achieving high parallel efficiencyand memory scalability. We demonstrate our optimizations on both inductive andtransductive GNN models. Our results show that DistDGL achieves linear speedupwithout compromising model accuracy and requires only 13 seconds to complete atraining epoch for a graph with 100 million nodes and 3 billion edges on acluster with 16 machines.

Quick Read (beta)

loading the full paper ...