Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning

Abstract

Multi-layer neural networks have lead to remarkable performance on many kindsof benchmark tasks in text, speech and image processing. Nonlinear parameterestimation in hierarchical models is known to be subject to overfitting. Oneapproach to this overfitting and related problems (local minima, colinearity,feature discovery etc.) is called dropout (Srivastava, et al 2014, Baldi et al2016). This method removes hidden units with a Bernoulli random variable withprobability $p$ over updates. In this paper we will show that Dropout is aspecial case of a more general model published originally in 1990 called thestochastic delta rule ( SDR, Hanson, 1990). SDR parameterizes each weight inthe network as a random variable with mean $\mu_{w_{ij}}$ and standarddeviation $\sigma_{w_{ij}}$. These random variables are sampled on each forwardactivation, consequently creating an exponential number of potential networkswith shared weights. Both parameters are updated according to prediction error,thus implementing weight noise injections that reflect a local history ofprediction error and efficient model averaging. SDR therefore implements alocal gradient-dependent simulated annealing per weight converging to a bayesoptimal network. Tests on standard benchmarks (CIFAR) using a modified versionof DenseNet shows the SDR outperforms standard dropout in error by over 50% andin loss by over 50%. Furthermore, the SDR implementation converges on asolution much faster, reaching a training error of 5 in just 15 epochs withDenseNet-40 compared to standard DenseNet-40's 94 epochs.

Quick Read (beta)

loading the full paper ...