Distributed training of deep nets is an important technique to address someof the present day computing challenges like memory consumption andcomputational demands. Classical distributed approaches, synchronous orasynchronous, are based on the parameter server architecture, i.e., workernodes compute gradients which are communicated to the parameter server whileupdated parameters are returned. Recently, distributed training with AllReduceoperations gained popularity as well. While many of those operations seemappealing, little is reported about wall-clock training time improvements. Inthis paper, we carefully analyze the AllReduce based setup, propose timingmodels which include network latency, bandwidth, cluster size and compute time,and demonstrate that a pipelined training with a width of two combines the bestof both synchronous and asynchronous training. Specifically, for a setupconsisting of a four-node GPU cluster we show wall-clock time trainingimprovements of up to 5.4x compared to conventional approaches.