In recent years, state-of-the-art methods in computer vision have utilizedincreasingly deep convolutional neural network architectures (CNNs), with someof the most successful models employing hundreds or even thousands of layers. Avariety of pathologies such as vanishing/exploding gradients make training suchdeep networks challenging. While residual connections and batch normalizationdo enable training at these depths, it has remained unclear whether suchspecialized architecture designs are truly necessary to train deep CNNs. Inthis work, we demonstrate that it is possible to train vanilla CNNs with tenthousand layers or more simply by using an appropriate initialization scheme.We derive this initialization scheme theoretically by developing a mean fieldtheory for signal propagation and by characterizing the conditions fordynamical isometry, the equilibration of singular values of the input-outputJacobian matrix. These conditions require that the convolution operator be anorthogonal transformation in the sense that it is norm-preserving. We presentan algorithm for generating such random initial orthogonal convolution kernelsand demonstrate empirically that they enable efficient training of extremelydeep architectures.