Recent theoretical results show that gradient descent on deep neural networksunder exponential loss functions locally maximizes classification margin, whichis equivalent to minimizing the norm of the weight matrices under marginconstraints. This property of the solution however does not fully characterizethe generalization performance. We motivate theoretically and show empiricallythat the area under the curve of the margin distribution on the training set isin fact a good measure of generalization. We then show that, after dataseparation is achieved, it is possible to dynamically reduce the training setby more than 99% without significant loss of performance. Interestingly, theresulting subset of "high capacity" features is not consistent across differenttraining runs, which is consistent with the theoretical claim that all trainingpoints should converge to the same asymptotic margin under SGD and in thepresence of both batch normalization and weight decay.