Parallel implementations of stochastic gradient descent (SGD) have receivedsignificant research attention, thanks to excellent scalability properties ofthis algorithm, and to its efficiency in the context of training deep neuralnetworks. A fundamental barrier for parallelizing large-scale SGD is the factthat the cost of communicating the gradient updates between nodes can be verylarge. Consequently, lossy compression heuristics have been proposed, by whichnodes only communicate quantized gradients. Although effective in practice,these heuristics do not always provably converge, and it is not clear whetherthey are optimal. In this paper, we propose Quantized SGD (QSGD), a family of compressionschemes which allow the compression of gradient updates at each node, whileguaranteeing convergence under standard assumptions. QSGD allows the user totrade off compression and convergence time: it can communicate a sublinearnumber of bits per iteration in the model dimension, and can achieveasymptotically optimal communication cost. We complement our theoreticalresults with empirical data, showing that QSGD can significantly reducecommunication cost, while being competitive with standard uncompressedtechniques on a variety of real tasks. In particular, experiments show that gradient quantization applied totraining of deep neural networks for image classification and automated speechrecognition can lead to significant reductions in communication cost, andend-to-end training time. For instance, on 16 GPUs, we are able to train aResNet-152 network on ImageNet 1.8x faster to full accuracy. Of note, we showthat there exist generic parameter settings under which all known networkarchitectures preserve or slightly improve their full accuracy when usingquantization.