Poly-time universality and limitations of deep learning

Abstract

The goal of this paper is to characterize function distributions that deeplearning can or cannot learn in poly-time. A universality result is proved forSGD-based deep learning and a non-universality result is proved for GD-baseddeep learning; this also gives a separation between SGD-based deep learning andstatistical query algorithms: (1) {\it Deep learning with SGD is efficiently universal.} Any functiondistribution that can be learned from samples in poly-time can also be learnedby a poly-size neural net trained with SGD on a poly-time initialization withpoly-steps, poly-rate and possibly poly-noise. Therefore deep learning provides a universal learning paradigm: it was knownthat the approximation and estimation errors could be controlled with poly-sizeneural nets, using ERM that is NP-hard; this new result shows that theoptimization error can also be controlled with SGD in poly-time. The picturechanges for GD with large enough batches: (2) {\it Result (1) does not hold for GD:} Neural nets of poly-size trainedwith GD (full gradients or large enough batches) on any initialization withpoly-steps, poly-range and at least poly-noise cannot learn any functiondistribution that has super-polynomial {\it cross-predictability,} where thecross-predictability gives a measure of ``average'' function correlation --relations and distinctions to the statistical dimension are discussed. Inparticular, GD with these constraints can learn efficiently monomials of degree$k$ if and only if $k$ is constant. Thus (1) and (2) point to an interesting contrast: SGD is universal even withsome poly-noise while full GD or SQ algorithms are not (e.g., parities).

Quick Read (beta)

loading the full paper ...