Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think

Abstract

We perform an empirical study of the behaviour of deep networks when fullylinearizing some of its feature channels through a sparsity prior on theoverall number of nonlinear units in the network. In experiments on imageclassification and machine translation tasks, we investigate how much we cansimplify the network function towards linearity before performance collapses.First, we observe a significant performance gap when reducing nonlinearity inthe network function early on as opposed to late in training, in-line withrecent observations on the time-evolution of the data-dependent NTK. Second, wefind that after training, we are able to linearize a significant number ofnonlinear units while maintaining a high performance, indicating that much of anetwork's expressivity remains unused but helps gradient descent in earlystages of training. To characterize the depth of the resulting partiallylinearized network, we introduce a measure called average path length,representing the average number of active nonlinearities encountered along apath in the network graph. Under sparsity pressure, we find that the remainingnonlinear units organize into distinct structures, forming core-networks ofnear constant effective depth and width, which in turn depend on taskdifficulty.

Quick Read (beta)

loading the full paper ...