We analyze the dynamics of training deep ReLU networks and their implicationson generalization capability. Using a teacher-student setting, we discovered anovel relationship between the gradient received by hidden student nodes andthe activations of teacher nodes for deep ReLU networks. With this relationshipand the assumption of small overlapping teacher node activations, we prove that(1) student nodes whose weights are initialized to be close to teacher nodesconverge to them at a faster rate, and (2) in over-parameterized regimes and2-layer case, while a small set of lucky nodes do converge to the teachernodes, the fan-out weights of other nodes converge to zero. This frameworkprovides insight into multiple puzzling phenomena in deep learning likeover-parameterization, implicit regularization, lottery tickets, etc. We verifyour assumption by showing that the majority of BatchNorm biases of pre-trainedVGG11/16 models are negative. Experiments on (1) random deep teacher networkswith Gaussian inputs, (2) teacher network pre-trained on CIFAR-10 and (3)extensive ablation studies validate our multiple theoretical predictions.