Data parallelism can boost the training speed of convolutional neuralnetworks (CNN), but could suffer from significant communication costs caused bygradient aggregation. To alleviate this problem, several scalar quantizationtechniques have been developed to compress the gradients. But these techniquescould perform poorly when used together with decentralized aggregationprotocols like ring all-reduce (RAR), mainly due to their inability to directlyaggregate compressed gradients. In this paper, we empirically demonstrate thestrong linear correlations between CNN gradients, and propose a gradient vectorquantization technique, named GradiVeQ, to exploit these correlations throughprincipal component analysis (PCA) for substantial gradient dimensionreduction. GradiVeQ enables direct aggregation of compressed gradients, henceallows us to build a distributed learning system that parallelizes GradiVeQgradient compression and RAR communications. Extensive experiments on popularCNNs demonstrate that applying GradiVeQ slashes the wall-clock gradientaggregation time of the original RAR by more than 5X without noticeableaccuracy loss, and reduces the end-to-end training time by almost 50%. Theresults also show that GradiVeQ is compatible with scalar quantizationtechniques such as QSGD (Quantized SGD), and achieves a much higher speed-upgain under the same compression ratio.