Abstract
Deep learning neural network models must be large enough to adapt to theirproblem domain, while small enough to avoid overfitting training data duringgradient descent. To balance these competing demands, overprovisioned deeplearning models such as transformers are trained for a single epoch on largedata sets, and hence inefficient with both computing resources and trainingdata. In response to these inefficiencies, we exploit learning theory to deriveOccam Gradient Descent, an algorithm that interleaves adaptive reduction ofmodel size to minimize generalization error, with gradient descent on modelweights to minimize fitting error. In contrast, traditional gradient descentgreedily minimizes fitting error without regard to generalization error. Ouralgorithm simultaneously descends the space of weights and topological size ofany neural network without modification. With respect to loss, compute andmodel size, our experiments show (a) on image classification benchmarks, linearand convolutional neural networks trained with Occam Gradient Descentoutperform traditional gradient descent with or without post-train pruning; (b)on a range of tabular data classification tasks, neural networks trained withOccam Gradient Descent outperform traditional gradient descent, as well asRandom Forests; (c) on natural language transformers, Occam Gradient Descentoutperforms traditional gradient descent.