Complexity of Linear Regions in Deep Networks

  • 2019-06-11 17:12:14
  • Boris Hanin, David Rolnick
  • 0

Abstract

It is well-known that the expressivity of a neural network depends on itsarchitecture, with deeper networks expressing more complex functions. In thecase of networks that compute piecewise linear functions, such as those withReLU activation, the number of distinct linear regions is a natural measure ofexpressivity. It is possible to construct networks with merely a single region,or for which the number of linear regions grows exponentially with depth; it isnot clear where within this range most networks fall in practice, either beforeor after training. In this paper, we provide a mathematical framework to countthe number of linear regions of a piecewise linear network and measure thevolume of the boundaries between these regions. In particular, we prove thatfor networks at initialization, the average number of regions along anyone-dimensional subspace grows linearly in the total number of neurons, farbelow the exponential upper bound. We also find that the average distance tothe nearest region boundary at initialization scales like the inverse of thenumber of neurons. Our theory suggests that, even after training, the number oflinear regions is far below exponential, an intuition that matches ourempirical observations. We conclude that the practical expressivity of neuralnetworks is likely far below that of the theoretical maximum, and that this gapcan be quantified.

 

Quick Read (beta)

Complexity of Linear Regions in Deep Networks

Boris Hanin    David Rolnick
Abstract

It is well-known that the expressivity of a neural network depends on its architecture, with deeper networks expressing more complex functions. In the case of networks that compute piecewise linear functions, such as those with ReLU activation, the number of distinct linear regions is a natural measure of expressivity. It is possible to construct networks with merely a single region, or for which the number of linear regions grows exponentially with depth; it is not clear where within this range most networks fall in practice, either before or after training. In this paper, we provide a mathematical framework to count the number of linear regions of a piecewise linear network and measure the volume of the boundaries between these regions. In particular, we prove that for networks at initialization, the average number of regions along any one-dimensional subspace grows linearly in the total number of neurons, far below the exponential upper bound. We also find that the average distance to the nearest region boundary at initialization scales like the inverse of the number of neurons. Our theory suggests that, even after training, the number of linear regions is far below exponential, an intuition that matches our empirical observations. We conclude that the practical expressivity of neural networks is likely far below that of the theoretical maximum, and that this gap can be quantified.

ReLU, linear region, expressivity, learnability, initialization, random network

Figure 1: How many linear regions? This figure shows a two-dimensional slice through the 784-dimensional input space of vectorized MNIST, as represented by a fully-connected ReLU network with three hidden layers of width 64 each. Colors denote different linear regions of the piecewise linear network.

1 Introduction

A growing field of theory has sought to explain the broad success of deep neural networks via a mathematical characterization of the ability of these networks to approximate different functions of input data. Many such works consider the expressivity of neural networks, showing that certain functions are more efficiently expressible by deep architectures than by shallow ones (e.g. Bianchini & Scarselli (2014); Montufar et al. (2014); Telgarsky (2015); Lin et al. (2017); Rolnick & Tegmark (2018)). It has, however, also been noted that many expressible functions are not efficiently learnable, at least by gradient descent (Shalev-Shwartz et al., 2018). More generally, the typical behavior of a network used in practice, the practical expressivity, may be very different from what is theoretically attainable. To adequately explain the power of deep learning, it is necessary to consider networks with parameters as they will naturally occur before, during, and after training.

Networks with a piecewise linear activation (e.g. ReLU, hard tanh) compute piecewise linear functions for which input space is divided into pieces, with the network computing a single linear function on each piece (see Figures 1-4). Figure 2 shows how the complexity of these pieces, which we refer to as linear regions, changes in a deep ReLU net with two-dimensional inputs. Each neuron in the first layer splits the input space into two pieces along a hyperplane, fitting a different linear function to each of the pieces. Subsequent layers split the regions of the preceding layers. The local density of linear regions serves as a convenient proxy for the local complexity or smoothness of the network, with the ability to interpolate a complex data distribution seeming to require fitting many relatively small regions. The topic of counting linear regions is taken up by a number of authors (Telgarsky, 2015; Montufar et al., 2014; Serra et al., 2018; Raghu et al., 2017).

A worst case estimate is that every neuron in each new layer splits each of the regions present at the previous layer, giving a number of regions exponential in the depth. Indeed this is possible, as examined extensively e.g. in Montufar et al. (2014). An example of Telgarsky (2015) shows that a sawtooth function with 2n teeth can be expressed exactly using only 3n+4 neurons, as shown in Figure 3. However, even slightly perturbing this network (by adding noise to the weights and biases) ruins this beautiful structure and severely reduces the number of linear pieces, raising the question of whether typical neural networks actually achieve the theoretical bounds for numbers of linear regions.

Figure 2: Evolution of linear regions within a ReLU network for 2-dimensional input. Each neuron in the first layer defines a linear boundary that partitions the input space into two regions. Neurons in the second layer combine and split these linear boundaries into higher level patterns of regions, and so on. Ultimately, the input space is partitioned into a number of regions, on each of which the neural network is given by a (different) linear function. During training, both the partition into regions and the linear functions on them are learned.

Figure 1 also invites measures of complexity for piecewise linear networks beyond region counting. The boundary between two linear regions can be straight or can be bent in complex ways, for example, suggesting the volume of the boundary between linear regions as complexity measure for the resulting partition of input space. In the 2D example of Figure 1, this corresponds to computing perimeters of the linear pieces. As we detail below, this measure has another natural advantage: the volume of the boundary controls the typical distance from a random input to the boundary of its linear region (see §2.2). This measures the stability of the function computed by the network, and it is intuitively related to robustness under adversarial perturbation.

Figure 3: The sawtooth function on the left with 2n teeth can be expressed succinctly by a ReLU network with only 3n+4 neurons (construction from Telgarsky (2015)). However, slight perturbation of the weights and biases of the network (by Gaussian noise with standard deviation 0.1) greatly simplifies the linear regions captured by the network.

Our Contributions. In this paper, we provide mathematical tools for analyzing the complexity of linear regions of a network with piecewise linear activations (such as ReLU) before, during, and after training. Our main contributions are as follows:

  • For networks at initialization, the total surface area of the boundary between linear regions scales as the number of neurons times the number of breakpoints of the activation function. This is our main result, from which several corollaries follow (see Theorem 3, Corollary 4, and the discussion in §2).

  • In particular, for any line segment through input space, the average number of regions intersecting it is linear in the number of neurons, far below the exponential number of regions that is theoretically attainable.

  • Theorem 3 also allows us to conclude that, at initialization, the average distance from a sample point to the nearest region boundary is bounded below by a constant times the reciprocal of the number of neurons (see Corollary 5).

  • We find empirically that both the number of regions and the distance to the nearest region boundary stay roughly constant during training and in particular are far from their theoretical maxima. That this should be the case is strongly suggested by Theorem 3, though not a direct consequence of it.

Overall, our results stress that practical expressivity lags significantly behind theoretical expressivity. Moreover, both our theoretical and empirical findings suggest that for certain measures of complexity, trained deep networks are remarkably similar to the same networks at initialization.

In the next section, we informally state our theoretical and empirical results and explore the underlying intuitions. Detailed descriptions of our experiments are provided in §3. The precise theorem statements for ReLU networks can be found in §5. The exact formulations for general piecewise linear networks are in Appendix A, with proofs in the rest of the Supplementary Material. In particular, Appendix B contains intuition for how our proofs are shaped, while details are completed in §C-D.

Figure 4: Graph of function computed by a ReLU net with input and output dimension 1 at initialization. The weights of the network are He normal (i.i.d. normal with variance = 2/fan-in) and the biases are i.i.d. normal with variance 10-6.

2 Informal Overview of Results

This section gives an informal introduction to our results. We begin in §2.1 by describing the case of networks with input dimension 1. In §2.2, we consider networks with higher input dimension. For simplicity, we focus throughout this section on fully connected ReLU networks. We emphasize, however, that our results apply to any piecewise linear activation. Moreover, the upper bounds we present in Theorems 1, 2, and 3 (and hence in Corollaries 4 and 5) can also be generalized to hold for feed-forward networks with arbitrary connectivity, though we do not go into details in this work, for the sake of clarity of exposition.

2.1 Number of Regions in 1D

Consider the simple case of a ReLU net 𝒩 with input and output dimensions equal to 1. Such a network computes a piecewise linear function (see Figure 4), and we are interested in understanding both at initialization and during training the number of distinct linear regions. There is a simple universal upper bound:

max #{regions}2#neurons, (1)

where the maximum is over all settings of weight and biases. This bound depends on the architecture of 𝒩 only via the number of neurons. For more refined upper bounds which take into account the widths of the layers, see Theorem 1 in Raghu et al. (2017) and Theorem 1 in Serra et al. (2018).

The constructions in Montufar et al. (2014); Telgarsky (2015); Raghu et al. (2017); Serra et al. (2018) indicate that the bound in (1) is very far from sharp for shallow and wide networks but that exponential growth in the number of regions can be achieved in deep, skinny networks for very special choices of weights and biases. This is a manifestation of the expressive power of depth, the idea that repeated compositions allow deep networks to capture complex hierarchical relations more efficiently per parameter than their shallow cousins. However, there is no non-trivial lower bound for the number of linear regions:

min #{regions}=1,𝒩.

The minimum is attained by setting all weights and biases to 0. This raises the question of the behavior for the average number of regions when the weights and biases are chosen at random (e.g. at initialization). Intuitively, configurations of weights and biases that come close to saturating the exponential upper bound (1) are numerically unstable in the sense that a small random perturbation of the weights and biases drastically reduces the number of linear regions (see Figure 3 for an illustration). In this direction, we prove a somewhat surprising answer to the question of how many regions 𝒩 has at initialization. We state the result for ReLU but note that it holds for any piecewise linear, continuous activation function (see Theorems 3 and 6).

Theorem 1 (informal).

Let N be a network with piecewise linear activation with input and output dimensions of N both equal 1. Suppose the weights and biases are randomly initialized so that for each neuron z, its pre-activation z(x) has bounded mean gradient

𝔼[z(x)]C,some C>0. (2)

This holds, for example, for ReLU networks initialized with independent, zero-centered weights with variance 2/fan-in. Then, for each subset IR of inputs, the average number of linear regions inside I is proportional to the number of neurons times the length of I

𝔼[#{regionsinI}]|I|T#{neurons},

where T is the number of breakpoints in the non-linearity of N (for ReLU nets, T=1). The same result holds when computing the number of linear regions along any fixed 1-dimensional curve in a high-dimensional input space.

This theorem implies that the average number of regions along a one-dimensional curve in input space is proportional to the number of neurons, but independent of the arrangement of those neurons. In particular, a shallow network and a deep network will have the same complexity, by this measure, as long as they have the same total number of neurons. Of course, as |I| grows, the bounds in Theorem 1 become less sharp. We plan to extend our results to obtain bounds on the total number of regions on all of in the future. In particular, we believe that at initialization the mean total number of linear regions 𝒩 is proportional to the number of neurons (this is borne out in Figure 5, which computes the total number of regions on an infinite line).

Theorem 1 defies the common intuition that, on average, each layer in 𝒩 multiplies the number of regions formed up to the previous layer by a constant larger than one. This would imply that the average number of regions is exponential in the depth. To provide intuition for why this is not true for random weights and biases, consider the effect of each neuron separately. Suppose the pre-activation z(x) of a neuron z satisfies |z(x)|=Θ(1), a hallmark of any reasonable initialization. Then, over a compact set of inputs, the piecewise linear function xz(x) cannot be highly oscillatory over a large portion of the range of z. Thus, if the bias bz is not too concentrated on any interval, we expect the equation z(x)=bz to have O(1) solutions. On average, then, we expect that each neuron adds a constant number of new linear regions. Thus, the average total number of regions should scale roughly as the number of neurons.

Theorem 1 follows from a general result, Theorem 3, that holds for essentially any non-degenerate distribution of weights and biases and with any input dimension. If z(x) and the bias distribution ρbz are well-behaved, then throughout training, Theorem 3 suggests the number of linear regions along a 1-dimensional curve in input space scales like the number of neurons in 𝒩. Figures 5-6 show experiments that give empirical verification of this heuristic.

Figure 5: We here show how the number of regions along 1D lines in input space changes during training. In accordance with Theorem 3, we scale the number of regions by the number of neurons. Plots show (a) early training, up through 0.5 epochs, and (b) later training, up through 20 epochs. Note that for all networks, number of regions is a fixed constant times the number of neurons at initialization, as predicted, and that the number decreases (slightly) early in training before rebounding. [n1,n2,n3] in the legend corresponds to an architecture with layer widths 784 (input),n1,n2,n3,10 (output).

2.2 Higher-Dimensional Regions

For networks with input dimension exceeding 1, there are several ways to generalize counting linear regions. A unit-matching heuristic applied to Theorem 1 suggests

#{regions}=#{neurons}nin,nin=input dim.

Proving this statement is work in progress by the authors. Instead, we consider here a natural and, in our view, equally important generalization. Namely, for a bounded Knin, we consider the (nin-1)-dimensional volume density

volnin-1(𝒩K)/volnin(K), (3)

where

𝒩={x|𝒩(x) is not continuous at x} (4)

is the boundary of the linear regions for 𝒩. When nin=1,

vol0(𝒩K)+1=#{regionsinK},

and hence the volume density (3) truly generalizes to higher input dimension of the number of regions. One reason for studying the volume density (3) is that it gives bounds from below for distance(x,𝒩), which in turn provides insight into the nature of the computation performed by 𝒩. Indeed, the exact formula

distance(x,𝒩)=minneuronsz{|z(x)-bz|/z(x)},

shows that distance(x,𝒩) measures the sensitivity over neurons at a given input x. In this formula, z(x) denotes the pre-activation for a neuron z and bz is its bias, so that ReLU(z(x)-bz) is the post-activation. Moreover, the distance from a typical point to 𝒩 gives a heuristic lower bound for the typical distance to an adversarial example: two inputs closer than the typical distance to a linear region boundary likely fall into the same linear region, and hence are unlikely to be classified differently. Our next result generalizes Theorem 1.

Theorem 2 (informal).

Let N be a network with a piecewise linear activation, input dimension nin and output dimension 1. Suppose its weights and biases are randomly initialized as in (2). Then, for KRdin bounded, the average volume of the linear region boundaries in K satisfies:

𝔼[volnin-1(𝒩K)volnin(K)]T#{neurons},

where T is the number of breakpoints in the non-linearity of N (for ReLU nets, T=1). Moreover, if x[0,1]nin is uniformly distributed, then the average, over both x and the weights/biases of N, distance from x to BN satisfies

𝔼[distance(x,𝒩)]C(#{neurons})-1,C>0.

Experimentally, distance(x,𝒩) remains comparable to (#{neurons})-1 throughout training (see Figure 6).

3 Experiments

We empirically verified our theorems and further examined how linear regions of a network change during training. All experiments below were performed with fully-connected networks, initialized with He normal weights (i.i.d. with variance 2/fan-in) and biases drawn i.i.d. normal with variance 10-6 (to prevent collapse of regions at initialization, which occurs when all biases are uniquely zero). Training was performed on the vectorized MNIST (input dimension 784) using the Adam optimizer at learning rate 10-3. All networks attain test accuracy in the range 95-98%.

Figure 6: We here consider the average distance to the nearest boundary, as evaluated over 10000 randomly selected sample points. In (a) we show that this distance is essentially bounded between 0.4/#{neurons} and 1.5/#{neurons}. Accordingly, in the next plot, we normalize the distance to the nearest boundary by dividing by the number of neurons. We plot this quantity against (b) epoch and (c) test accuracy. Observe that, in keeping with the findings of Figure 5, the distance to the nearest boundary first increases quickly (as the number of regions decreases), then rebounds more slowly as the network completes training. [n1,n2,n3] in the legend corresponds to an architecture with layer widths 784 (input),n1,n2,n3,10 (output).

3.1 Number of Regions Along a Line

We calculated the number of regions along lines through the origin and and a random selected training example in input space. For each setting of weights and biases within the network during training, the number of regions along each line is calculated exactly by building up the network one layer at a time and calculating how each region is split by the next layer of neurons. Figure 5 represents the average over 5 independent runs, from each of which we sample 100 lines; variance across the different runs is not significant.

Figure 5 plots the average number of regions along a line, divided by the number of neurons in the network, as a function of epoch during training. We make several observations:

  1. 1.

    As predicted by Theorem 3, all networks start out with the number of regions along a line equal to a constant times the number of neurons in the network (the constant in fact appears very close to 1 in this case).

  2. 2.

    Throughout training, the number of regions does not deviate significantly from the number of neurons in the network, staying within a small constant of the value at initialization, in keeping with our intuitive understanding of Theorem 3 described informally around Theorem 1 above.

  3. 3.

    The number of regions actually decreases during the initial part of training, then increases again. We explore this behavior further in other experiments below.

3.2 Distance to the Nearest Region Boundary

We calculated the average distance to the nearest boundary for 10000 randomly selected input points, for various networks throughout training. Points were selected randomly from a normal distribution with mean and variance matching the componentwise mean and variance of MNIST training data. Results were averaged over 12 independent runs, but variance across runs is not significant. Rerunning these experiments with sample points selected randomly from (i) the training data or (ii) the test data yielded similar results to random sample points.

In keeping with our results in the preceding experiment, the distance to the nearest boundary first increases then decreases during training. As predicted by Theorem 2, we find that for all networks, the distance to the nearest boundary is well-predicted by 1/#{neurons}. Throughout training, we find that it approximately varies between the curves 0.4/#{neurons} and 1.5/#{neurons} (Figure 6(a)). At initialization, as we predict, all networks have the same value for the product of number of neurons and distance to the nearest region boundary (Figure 6(b)); these products then diverge (slightly) for different architectures, first increasing rapidly and then decreasing more slowly.

We find Figure 6(c) fascinating, though we do not completely understand it. It plots the product of number of neurons and distance to the nearest region boundary against the test accuracy. It suggests two phases of training: first regions expand, then they contract. This lines up with observations made in Arpit et al. (2017) that neural networks “learn patterns first” on which generalization is simple and then refine the fit to encompass memorization of individual samples. A generalization phase would suggest that regions are growing, while memorization would suggest smaller regions are fit to individual data points. This is, however, speculation and more experimental (and theoretical) exploration will be required to confirm or disprove this intuition.

Figure 7: Distribution of log distances from random sample points to the nearest region boundary for a network of depth 4 and width 16, at initialization and after 1 and 20 epochs of training on MNIST.

We found it instructive to consider the full distribution of distances from sample points to their nearest boundaries, rather than just the average. For a single network (depth 4, width 16), Figure 7 indicates that this distribution does not significantly change during training, although there appears to be a slight skew towards larger regions, in agreement with the findings in Novak et al. (2018). The histogram shows log-distances. Hence, distance to the nearest region boundary varies over many orders of magnitude. This is consistent with Figures 1 and 4, which lend credence to the intuition that small distances to the nearest region boundary are explained by the presence of many small regions. According to Theorem 3, this should correlate with a combination of regions in input space at which some neurons have a large gradient and neurons with highly peaked biases distributions. We hope to return to this in future work.

3.3 Regions Within a 2D Plane

We visualized the regions of a network through training. Specifically, following experiments in Novak et al. (2018), we plotted regions within a plane in the 784-dimensional input space (Figure 8) through three data points with different labels (0, 1, and 2, in our case) inside a square centered at the circumcenter of the three examples. The network shown has depth 3 and width 64. We observe that, as expected from our other plots, the regions expand initially during training and then contract again. We expect the number of regions within a 2-dimensional subspace to be on the order of the square of the number of neurons – that is, (643)24×104, which we indeed find.

Our approach for calculating regions is simple. We start with a single region (in this case, the square), and subdivide it by adding neurons to the network one by one. For each new neuron, we calculate the linear function it defines on each region, and determine whether that region is split into two. This approach terminates within a reasonable amount of time precisely because our theorem holds: there are relatively few regions. Note that we exactly determine all regions within the given square by calculating all region boundaries; thus our counts are exact and do not miss any small regions, as might occur if we merely estimated regions by sampling points from input space.

Figure 8: Here we show the linear regions that intersect a 2D plane through input space for a network of depth 3 and width 64 trained on MNIST. Black dots indicate the positions of the three MNIST training examples defining the plane. Note that we obtain qualitatively different pictures from Novak et al. (2018), which may result partially from our using ReLU activation instead of ReLU6.

4 Related Work

There are a number of works that touch on the themes of this article: (i) the expressivity of depth; (ii) counting the number of regions in networks with piecewise linear activations; (iii) the behavior of linear regions through training; and (iv) the difference between expressivity and learnability. Related to (i), we refer the reader to Eldan & Shamir (2016); Telgarsky (2016) for examples of functions that can be efficiently represented by deep but not shallow ReLU nets. Next, still related to (i), for uniform approximation over classes of functions, again using deep ReLU nets, see Yarotsky (2017); Rolnick & Tegmark (2018); Yarotsky (2018); Petersen & Voigtlaender (2018). For interesting results on (ii) about counting the maximal possible number of linear regions in networks with piecewise linear activations see Bianchini & Scarselli (2014); Montufar et al. (2014); Poole et al. (2016); Arora et al. (2018); Raghu et al. (2017). Next, in the vein of (iii), for both a theoretical and empirical perspective on the number of regions computed by deep networks and specifically how the regions change during training, see Poole et al. (2016); Novak et al. (2018). In the direction of (iv), we refer the reader to Shalev-Shwartz et al. (2018); Hanin & Rolnick (2018); Hanin (2018). Finally, for general insights into learnability and expressivity in deep vs. shallow networks see Mhaskar & Poggio (2016); Mhaskar et al. (2016); Zhang et al. (2017); Lin et al. (2017); Poggio et al. (2017); Neyshabur et al. (2017).

5 Formal Statement of Results

To state our results precisely, we fix some notation. Let d,nin,n1,,nd1 and consider a depth d fully connected ReLU net 𝒩 with input dimension nin, output dimension 1, and hidden layer widths nj,j=1,,d-1. As explained in the introduction, a generic configuration of its weights and biases partitions the input space nin into a union of polytopes Pj with disjoint interiors. Restricted to each Pj, 𝒩 computes a linear function.

Our main mathematical result, Theorem 3, concerns the set 𝒩 of points xnin at which the gradient 𝒩 is discontinuous at x (see (4)). For each k=1,,nin, we define

𝒩,k=the ``(nin-k)–dimensional piece” of 𝒩. (5)

More precisely, we set 𝒩,0:= and recursively define 𝒩,k to be the set of points x𝒩\{𝒩,0𝒩,k-1} so that in a neighborhood of x the set 𝒩\{𝒩,0𝒩,k-1} coincides with a co-dimension k hyperplane.

For example, when nin=2, the linear regions Pj are polygons, the set 𝒩,1 is the union of the open line segments making up the boundaries of the Pj, and 𝒩,2 is the collection of vertices of the Pj. Theorem 3 provides a convenient formula for the average of the (nin-k)-dimensional volume of 𝒩,k inside any bounded, measurable set Knin. To state the result, for every neuron z in 𝒩 we will write

z(x):=pre-activation at z,(z)=layer index of z (6)

and bz:=bias at z. Thus, for a given input xn0, the post-activation of z is

Z(x):=ReLU(z(x))=max{0,z(x)-bz}. (7)

Theorem 3 holds under the following assumption on the distribution of weights and biases:

  1. A1:

    The conditional distribution of any collection of biases bz1,,bzk, given all the other weights and biases, has a density ρbz1,,bzk(b1,,bk) with respect to Lebesgue measure on k.

  2. A2:

    The joint distribution of all the weights has a density with respect to Lebesgue measure on #weights.

These assumptions hold in particular when the weights and biases of 𝒩 are independent with marginal distributions that have a density relative to Lebesgue measure on (i.e. at initialization). They hold much more generally, however, and can intuitively be viewed as a non-degeneracy assumption on the behavior of the weights and biases of 𝒩. Specifically, they are used in Proposition 10 to ensure that the set 𝒩,k consists of inputs where exactly k neurons turn off/on. Assumption (A1) also allows us, in Proposition 11, to apply the co-area formula (29) to compute the expect volume of the set of inputs where a given collection of neurons turn on/off. Our main result is the following.

Theorem 3.

Suppose N is a feed-forward ReLU net with input dimension n0, output dimension 1, and random weights/biases. Assume that the distribution of weights/biases satisfies Assumptions A1 and A2 above. Then, with the notation (6), for any bounded measurable set KRnin and any k=1,,nin, the average (nin-k)- dimensional volume E[volnin-k(BN,kK)] of BN,k inside K is

𝔼[volnin-k(𝒩,kK)] (8)

of BN,k inside K is, in the notation (6),

distinctneuronsz1,,zkin𝒩K𝔼[Yz1,,zk(x)]𝑑x,

where Yz1,,zk(x) is

Jz1,,zk(x)ρbz1,,bzk(z1(x),,zk(x))

times the indicator function of the event that zj is good at x for each j=1,,k. Here, Jz1,,zk is the k×nin Jacobian of the map x(z1(x),,zk(x)),

Jz1,,zk(x):=det(Jz1,,zk(x)(Jz1,,zk(x))T)1/2,

the function ρbz1,,bzk is the density of the joint distribution of the biases bz1,,bzk, and we say a neuron z is good at x if there exists a path of neurons from z to the output in the computational graph of N so that each neuron along this path is open at x).

To evaluate the expression in (8) requires information on the distribution of gradients z(x), the pre-activations z(x), and the biases bz. Exact information about these quantities is available at initialization (Hanin, 2018; Hanin & Rolnick, 2018; Hanin & Nica, 2018), yielding the following Corollary.

Corollary 4.

With the notation and assumptions of Theorem 3, suppose the weights are independent are drawn from a fixed probability measure μ on R that is symmetric around 0 and then rescaled to have Var[weights]= 2/fan-in. Fix k{1,,nin}. Then there exists C>0 for which

𝔼[volnin-k(𝒩,kK)]volnin(K) (9)
(#{neurons}k)(CgradCbias)k,

where

Cbias=supzsupbρbz(b)

and

Cgrad=supzsupxnin𝔼[z(x)2k]1/kCeCj=1d1nj

where C>0 depends only on μ but not on the architecture of N and nj is the width of the jth hidden layer. Moreover, we also have similar lower bounds

(#{neurons}k)cbiask𝔼[volnin-k(𝒩,kK)]volnin(K) (10)

where

cbias=inf|b|ηρbz(b),

and

η=(supxKx2nin+j=1dσbj2)eCj=1d1nj,

with C>0 depending only on the distribution μ of the weights in N.

We prove Corollary 7 in Appendix D. Let us state one final corollary of Theorem 3

Corollary 5.

Suppose N is as in Theorem 3 and satisfies the hypothesis (14) in Corollary 7. Then, for any compact set KRnin let x be a uniform point in K. There exists c>0 independent of K so that

𝔼[distance(x,𝒩)]cCbiasCgrad#{neurons}.

We prove Corollary 8 in §E. The basic idea is simple. For every ϵ>0, we have

𝔼[distance(x,𝒩)]ϵ(distance(x,𝒩)>ϵ),

with the probability on the right hand side scaling like

1-volnin(Tϵ(𝒩)K)/volnin(K),

where Tϵ(𝒩) is the tube of radius ϵ around 𝒩. We expect that its volume like ϵvolnin-1(𝒩). Taking ε=c/#{neurons} yields the conclusion of Corollary 8.

6 Conclusions and Further Work

The question of why depth is powerful has been a persistent problem for deep learning theory, and one that recently has been answered by works giving enhanced expressivity as the ultimate explanation. However, our results suggest that such explanations may be misleading. While we do not speak to all notions of expressivity in this paper, we have both theoretically and empirically evaluated one common measure: the linear regions in the partition of input space defined by a network with piecewise linear activations. We found that the average size of the boundary of these linear regions depends only on the number of neurons and not on the network depth – both at initialization and during training. This strongly suggests that deeper networks do not learn more complex functions than shallow networks. We plan to test this interpretation further in future work – for example, with experiments on more complex tasks, as well as by investigating higher order statistics, such as the variance.

We do not propose a replacement theory for the success of deep learning; however, prior work has already hinted at how such a theory might proceed. Notably, Ba & Caruana (2014) show that, once deep networks are trained to perform a task successfully, their behavior can often be replicated by shallow networks, suggesting that the advantages of depth may be linked to easier learning.

References

  • Arora et al. (2018) Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. In ICLR, 2018.
  • Arpit et al. (2017) Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In ICML, 2017.
  • Ba & Caruana (2014) Ba, J. and Caruana, R. Do deep nets really need to be deep? In NeurIPS, pp. 2654–2662, 2014.
  • Bianchini & Scarselli (2014) Bianchini, M. and Scarselli, F. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553–1565, 2014.
  • Eldan & Shamir (2016) Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In COLT, pp. 907–940, 2016.
  • Hanin (2018) Hanin, B. Which neural net architectures give rise to exploding and vanishing gradients? In NeurIPS, 2018.
  • Hanin & Nica (2018) Hanin, B. and Nica, M. Products of many large random matrices and gradients in deep neural networks. Preprint arXiv:1812.05994, 2018.
  • Hanin & Rolnick (2018) Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. In NeurIPS, 2018.
  • Lin et al. (2017) Lin, H. W., Tegmark, M., and Rolnick, D. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
  • Mhaskar et al. (2016) Mhaskar, H., Liao, Q., and Poggio, T. Learning functions: when is deep better than shallow. Preprint arXiv:1603.00988, 2016.
  • Mhaskar & Poggio (2016) Mhaskar, H. N. and Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
  • Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In NeurIPS, pp. 2924–2932, 2014.
  • Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. In NeurIPS, pp. 5947–5956, 2017.
  • Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and generalization in neural networks: an empirical study. In ICLR, 2018.
  • Petersen & Voigtlaender (2018) Petersen, P. and Voigtlaender, F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
  • Poggio et al. (2017) Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. Why and when can deep – but not shallow – networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
  • Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In NeurIPS, pp. 3360–3368, 2016.
  • Raghu et al. (2017) Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Dickstein, J. S. On the expressive power of deep neural networks. In ICML, pp. 2847–2854, 2017.
  • Rolnick & Tegmark (2018) Rolnick, D. and Tegmark, M. The power of deeper networks for expressing natural functions. In ICLR, 2018.
  • Serra et al. (2018) Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In ICML, 2018.
  • Shalev-Shwartz et al. (2018) Shalev-Shwartz, S., Shamir, O., and Shammah, S. Failures of gradient-based deep learning. In ICML, 2018.
  • Telgarsky (2015) Telgarsky, M. Representation benefits of deep feedforward networks. Preprint arXiv:1509.08101, 2015.
  • Telgarsky (2016) Telgarsky, M. Benefits of depth in neural networks. In COLT, 2016.
  • Yarotsky (2017) Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
  • Yarotsky (2018) Yarotsky, D. Optimal approximation of continuous functions by very deep ReLU networks. In COLT, 2018.
  • Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

Appendix A Formal Statement of Results for General Piecewise Linear Activations

In §5, we stated our results in the case of ReLU activation, and now frame these results for a general piecewise linear non-linearity. We fix some notation. Let ϕ: be a continuous piecewise linear function with T breakpoints ξ0=-<ξ1<ξ2<<ξT<ξT+1=. That is, there exist pj,qj so that

t[ξj,ξj+1]ϕ(t)=qjt+pj,qjqj+1. (11)

The analog of Theorem 3 for general ϕ is the following.

Theorem 6.

Let ϕ:RR be a continuous piecewise linear function with T breakpoints ξ1<<ξT as in (11). Suppose N is a fully connected network with input dimension nin, output dimension 1, random weights and biases satisfying A1 and A2 above, and non-linearity ϕ.

Let Jz1,,zk be the k×nin Jacobian of the map x(z1(x),,zk(x)),

Jz1,,zk(x):=det(Jz1,,zk(x)(Jz1,,zk(x))T)1/2,

and write ρbz1,,bzk for the density of the joint distribution of the biases bz1,,bzk. We say a neuron z is good at x if there exists a path of neurons from z to the output in the computational graph of N so that each neuron z^ along this path is open at x (i.e. ϕ(z^(x)-bz^)0).

Then, for any bounded, measurable set KRnin and any k=1,,nin, the average (nin-k)–dimensional volume

𝔼[volnin-k(𝒩,kK)]

of BN,k inside K is, in the notation of (6),

distinctneuronsz1,,zkin𝒩i1,,ik=1TK𝔼[Yz1,,zk(ξi1,,ξik)(x)]𝑑x, (12)

where Yz1,,zk(ξi1,,ξik)(x) equals

Jz1,,zk(x)ρbz1,,bzk(z1(x)-ξi1,,zk(x)-ξik) (13)

multiplied by the indicator function of the event that zj is good at x for every j.

Note that if in the definition (11) of ϕ we have that the possible values ϕ(t){q0,,qT} do not include 0, then we may ignore the event that zj are good at x in the definition of Yz1,,zk(ξi1,,ξik).

Corollary 7.

With the notation and assumptions of Theorem 6, suppose in addition that the weights and biases are independent. Fix k{1,,nin} and suppose that for every collection of distinct neurons z1,,zk, the average magnitude of the product of gradients is uniformly bounded:

supneuronsz1,,zkinputsx𝔼[j=1kzj(x)]Cgradk. (14)

Then we have the following upper bounds

𝔼[volnin-k(𝒩,kK)]volnin(K) (15)
(#{neurons}k)(T2CgradCbias)k,

where T is the number of breakpoints in the non-linearity ϕ of N (see (11)) and

Cbias=supzsupbρbz(b).

We prove Corollary 7 in §D and state a final corollary of Theorem 3:

Corollary 8.

Suppose N is as in Theorem 3 and satisfies the hypothesis (14) in Corollary 7 with constants Cbias,Cgrad. Then, for any compact set KRnin let x be a uniform point in K. There exists c>0 independent of K so that

𝔼[distance(x,𝒩)]cTCbiasCgrad#{neurons},

where, as before, T is the number of breakpoints in the non-linearity ϕ of N.

We prove Corollary 8 in §E. The basic idea is simple. For every ϵ>0, we have

𝔼[distance(x,𝒩)]ϵ(distance(x,𝒩)>ϵ),

with the probability on the right hand side scaling like

1-volnin(Tϵ(𝒩)K)/volnin(K),

where Tϵ(𝒩) is the tube of radius ϵ around 𝒩. We expect that its volume like ϵvolnin-1(𝒩). Taking ε=c/#{neurons} yields the conclusion of Corollary 8.

Appendix B Outline of Proof of Theorem 6

The purpose of this section is to give an intuitive explanation of the proof of Theorem 3. We fix a non-linearity ϕ: with breakpoints ξ1<<ξT (as in (11)) and consider a fully connected network 𝒩 with input dimension nin1, output dimension 1, and non-linearity ϕ. For each neuron z in 𝒩, we write

(z):=layer index of z (16)

and set

Sz:={xnin|z(x)-bz{ξ1,,ξT}}. (17)

We further

S~z:=Sz𝒪, (18)

where

𝒪:={xnin|j=1,,d neuron z with (z)=j s.t. ϕ(z(x)-bz)0}.

Intuitively, the set Sz is the collection of inputs for which the neuron z turns from on to off. In contrast, the set 𝒪 is the collection of inputs xnin for which 𝒩 is open in the sense that there is a path from the input to the output of 𝒩 so that all neurons along this path compute are not constant in a neighborhood x. Thus, S~z is the set of inputs at which neuron z switches between its linear regions and at which the output of neuron z actually affects the function computed by 𝒩.

We remark here that 𝒪= if in the non-linearity ϕ there are no linear pieces at which the slopes on ϕ equals 0 (i.e. qj0 for all j in the definition (11) of ϕ). If, for example, ϕ is ReLU, then 𝒪 need not be empty.

The overall proof of Theorem 3 can be divided into several steps. The first gives the following representation of 𝒩.

Proposition 9.

Under Assumptions A1 and A2 of Theorem 3, we have, with probability 1,

𝒩=neuronszS~z.

The precise proof of Proposition 9 can be found in §C.1 below. The basic idea is that if for all y near a fixed input xnin, none of the pre-activations z(y)-bz cross the boundary of a linear region for ϕ, then x𝒩. Thus, 𝒩zSz. Moreover, if a neuron z satisfies z(x)-bz=Si for some i but there are no open paths from z to the output of 𝒩 for inputs near x, then z is dead at x and hence does not influence 𝒩 at x. Thus, we expect the more refined inclusion 𝒩zS~z. Finally, if xS~z for some z then x𝒩 unless the contribution from other neurons to 𝒩(y) for y near x exactly cancels the discontinuity in z(x). This happens with probability 0.

The next step in proving Theorem 3 is to identify the portions of 𝒩 of each dimension. To do this, we write for any distinct neurons z1,,zk,

S~z1,,zk:=j=1kS~zj.

The set S~z1,,zk is, intuitively, the collection of inputs at which zj(x)-bzj switches between linear regions for ϕ and at which the output of 𝒩 is affected by the post-activations of these neurons. Proposition 9 shows that we may represent 𝒩 as a disjoint union

𝒩=k=1nin𝒩,k,

where

𝒩,k:=distinct neuronsz1,,zkS~z1,,zk(zz1,,zkS~z)c.

In words, 𝒩,k is the collection of inputs in 𝒪 at which exactly k neurons turn from on to off. The following Proposition shows that 𝒩,k is precisely the “(nin-k)-dimensional piece of 𝒩” (see (5)).

Proposition 10.

Fix k=1,,nin, and k distinct neurons z1,,zk in N. Then, with probability 1, for every xBN,k there exists a neighborhood in which BN,k coincides with a (nin-k)-dimensional hyperplane.

We prove Proposition 10 in §C.2. The idea is that each S~z1,,zk is piecewise linear and, with probability 1, at every point at which exactly the neurons z1,,zk contribute to 𝒩, its co-dimension is the number of linear conditions needed to define it. Observe that with probability 1, the bias vector (bz1,,bzk+1) for any collection z1,,zk+1 of distinct neurons is a regular value for x(z1(x),,zk+1(x)). Hence,

volnin-k(S~z1,,zk+1)=0.

Proposition 10 thus implies that, with probability 1,

volnin-k(𝒩,k)=distinctneuronsz1,,zkvolnin-k(S~z1,,zk).

The final step in the proof of Theorem 3 is therefore to prove the following result.

Proposition 11.

Let z1,,zk be distinct neurons in N. Then, for any bounded, measurable KRnin,

𝔼[volnin-k(S~z1,,zk)]
=Ki1,,ik=1T𝔼[Yz1,,zk(Si1,,Sik)(x)]dx,

where Yz1,,zk(Si1,,Sik) is defined as in (13).

We provide a detailed proof of Proposition 11 in §C.3. The intuition is that the image of the volume element dx under xz(x)-Si is the volume element

Jz1,,zk(x)dx

from (13). The probability of an infinitesimal neighborhood dx of x belonging to a (nin-k)-dimensional piece of 𝒩 is therefore the probability

ρbz1,,bzk(z1(x)-Si1,,zk(x)-Sik)
×Jz1,,zk(x)dx

that the vector of biases (bzj,j=1,,k) belongs to the image of dx under map (zj(x)-Sij,j=1,,k) for some collection of breakpoints Sij. The formal argument uses the co-area formula (see (29) and (30)).

Appendix C Proof of Theorem 3

C.1 Proof of Proposition 9

Recall that the non-linearity ϕ: is continuous and piecewise linear with T breakpoints ξ1<<ξT, so that, with ξ0=-,ξT+1=, we have

t(ξi,ξi+1)ϕ(t)=qit+pi

with qiqi+1. For each xnin, write

Zx+ :={z|z(x)-bz(ξi,ξi+1) and qi0 for some i}
Zx- :={z|z(x)-bz(ξi,ξi+1) and qi=0 for some i}
Zx0 :={z|z(x)-bz=ξi for some i}

Intuitively, Zx+ are the neurons that, at the input x are open (i.e. contribute to the gradient of the output 𝒩(x)) but do not change their contribution in a neighborhood of x, Zx- are the neurons that are closed, and Zx0 are the neurons that, at x, produce a discontinuity in the derivative of 𝒩. Thus, for example, if ϕ=ReLU, then

Zx*:={z|sgn(z(x)-bz)=*},*{+,-,0}.

We begin by proving that 𝒩zS~z by checking the contrapositive

(zS~z)c𝒩c. (19)

Fix x(zS~z)c. Note that Zx± are locally constant in the sense that there exists ε>0 so that for all y with y-x<ε, we have

Zx-Zy-,Zx+Zy+,Zy+Zy0Zx+Zx0. (20)

Moreover, observe that if in the definition (11) of ϕ none of the slopes qi equal 0, then Zy-= for every y. To prove (19), consider any path γ from the input to the output in the computational graph of 𝒩. Such a path consists of d+1 neurons, one in each layer:

γ=(zγ(0),,zγ(d)),(zγ(j))=j.

To each path we may associate a sequence of weights:

wγ(j):=weight connecting zγ(j-1) to zγ(j),j=1,,d.

We will also define

qγ(j)(x):=i=0Tqi𝟏{zγ(x)-bzγ(j)(ξi,ξi+1]}.

For instance, if ϕ=ReLU, then

qγ(j)(x)=𝟏{zγ(j)(x)-bz0},

and in general only one term in the definition of qγ(j)(x) is non-zero for each z. We may write

𝒩(y)=i=1ninyipaths γ:ioutj=1dqγ(j)(y)wγ(j)+constant, (21)

Note that if x(zS~z)c, then for any path γ through a neuron zZx0, we have

j s.t. zγ(j)Zx-.

This is an open condition in light of (20), and hence for all y in a neighborhood of x and for any path γ through a neuron zZx0 we also have that

j s.t. zγ(j)Zy-.

Thus, since the summand in (21) vanishes identically if γZy-, we find that for y in a neighborhood of any x(zS~z)c we may write

𝒩(y)=i=1ninyipaths γ:ioutγZx+j=1dqγ(j)(y)wγ(j)+constant. (22)

But, again by (20), for any fixed x, all y in a neighborhood of x and each zZx+, we have zZy+ as well. Thus, in particular,

z(x)-bz(ξi,ξi+1)z(y)-bz(ξi,ξi+1).

Thus, for y sufficiently close to x, we have for every path in the sum (22) that

qγ(j)(y)=qγ(j)(x).

Therefore, the partial derivatives (𝒩/yi)(y) are independent of y in a neighborhood of x and hence continuous at x. This proves (19). Let us now prove the reverse inclusion:

zS~z𝒩 (23)

Note that, with probability 1, we have

volnin-1(Sz1Sz2)=0

for any pair of distinct neurons z1,z2. Note also that since x𝒩(x) is continuous and piecewise linear, the set 𝒩 is closed. Thus, it is enough to show the slightly weaker inclusion

z(S~z\z^zSz^)𝒩 (24)

since the closure of S~z\z^zSz^ equals S~z. Fix a neuron z and suppose xS~z\z^zSz^. By definition, we have that for every neuron z^z, either

z^Zx+orz^Zx-.

This has two consequences. First, by (20), the map yz(y) is linear in a neighborhood of x. Second, in a neighborhood of x, the set S~z coincides with Sz. Hence, combining these facts, near x the set S~z coincides with the hyperplane

{x|z(x)-bz=ξi},for some i. (25)

We may take two sequences of inputs yn+,yn- on opposite sides of this hyperplane so that

limnyn+=limnyn-=x

and

ϕ(z(yn+)-bz)=qi,ϕ(z(yn+)-bz)=qi-1,n,

where the index i the same as the one that defines the hyperplane (25). Further, since 𝒩 has co-dimension 1 (it is contained in the piecewise linear co-dimension 1 set zSz, for example), we may also assume that yn+,yn-𝒩. Consider any path γ from the input to the output of the computational graph of 𝒩 passing through z (so that z=zγ(j)γ). By construction, for every n, we have

qγ(j)(yn+)qγ(j)(yn-),

and hence, after passing to a subsequence, we may assume that the symmetric difference

Zyn++ΔZyn-+ (26)

of the paths that contribute to the representation (21) for yn+,yn- is fixed and non-empty (the latter since it always contains z). For any y𝒩, we may write, for each i=1,,nin

𝒩yi(y)=pathsγ:ioutγZy+j=1dqγ(j)(y)wγ(j). (27)

Substituting into this expression y=yn±, we find that there exists a non-empty collection Γ of paths from the input to the output of 𝒩 so that

𝒩yi(yn+)-𝒩yi(yn-)=γΓajj=1dcγ(j)wγ(j)

where

aj{-1,1},cγ(j){q0,,qT}.

Note that the expression above is a polynomial in the weights of 𝒩. Note also that, by construction, this polynomial is not identically zero due to the condition (26). There are only finitely many such polynomials since both aj and cγ(j) range over a finite alphabet. For each such non-zero polynomial, the set of weights at which it vanishes has co-dimension 1. Hence, with probability 1, the difference 𝒩yi(yn+)-𝒩yi(yn-) is non-zero. This shows that the partial derivatives 𝒩yi are not continuous at x and hence that x𝒩.

C.2 Proof of Proposition 10

Fix distinct neurons z1,,zk and suppose xS~z1,,zk but not in S~z for any zz1,,zk. After relabeling, we may assume that they are ordered by layer index:

(z1)(zk).

Since x𝒪, we also have that xSz for any zz1,,zk. Thus, there exists a neighborhood U of x so SzU= for every zz1,,zk. Thus, there exists a neighborhood of x on which yz1(y) is linear.

Hence, as explained near (25) above, S~z1 is a hyperplane near x. We now restrict our inputs to this hyperplane and repeat this reasoning to see that, near x, the set S~z1,z2 is a hyperplane inside S~z1 and hence, near x, is the intersection of two hyperplanes in nin. Continuing in this way shows that in a neighborhood of x, the set S~z1,,zk is equal to the intersection of k hyperplanes in nin. Thus, S~z1,,zk\(zz1,,zkS~z)c is precisely the intersection of k hyperplanes in a neighborhood of each of its points.

C.3 Proof of Proposition 11

Let z1,,zk be distinct neurons in 𝒩, and fix a compact set Knin. We seek to compute the mean of volnin-k(S~z1,,zkK), which we may rewrite as

Sz1,,zkK𝟏{zj is good at xj=1,,k}dvolnin-k(x) (28)
=i1,,ik=1TSz1,,zk(ξi1,,ξik)K𝟏{zj is good at xj=1,,k}dvolnin-k(x),

where we’ve set

Sz1,,zk(ξi1,,ξik)={x|zj(x)-bzj=ξij,j=1,,k}.

Note that the map x(z1(x),,zk(x)) is Lipschitz, and recall the co-area formula, which says that if ψL1(n) and g:nm with mn is Lipschitz, then

mg-1(t)ψ(x)dvoln-m(x)𝑑t (29)

equals

nψ(x)Jg(x)dvoln(x), (30)

where Jg is the m×n Jacobian of g and

Jg(x)=det((Jg(x))(Jg(x))T)1/2.

We assumed that the biases bz1,,bzj have a joint conditional density

ρ𝐛𝐳=ρbz1,,bzk

given all other weights and biases. The mean of the term in (28) corresponding to a fixed ξ=(ξi1,,ξik) over the conditional distribution of bz1,,bzj is therefore

k𝑑𝐛ρ𝐛𝐳(𝐛){𝐳-𝐛=ξ}K𝟏{zj is good at xj=1,,k}dvolnin-k(x),

where we’ve abbreviated 𝐛=(b1,,bk) as well as 𝐳(x)=(z1(x),,zk(x)). This can rewritten as

k𝑑𝐛{𝐳=𝐛}Kρ𝐛𝐳(𝐳(x)-ξ)𝟏{zj is good at xj=1,,k}dvoln0-k(x).

Thus, applying the co-area formula (29), (30) shows that the average of (28) over the conditional distribution of bz1,,bzj is precisely

KYz1,,zk(x)𝑑x.

Taking the average over the remaining weighs and biases, we may commute the expectation 𝔼[] with the dx integral since the integrand is non-negative. This completes the proof of Proposition 11.

Appendix D Proof of Corollary 7

We begin by proving the upper bound in (15). By Theorem 3, 𝔼[vol(𝒩,kK)] equals

distinctneuronsz1,,zki1,,ik=1TK𝔼[Yz1,,zk(ξi1,,ξik)(x)](x)𝑑x,

where, as in (13), Yz1,,zk(ξi1,,ξik)(x) is

Jz1,,zk(x)ρbz1,,bzk(z1(x)-ξi1,,zk(z)-ξik)

times the indicator function of the even that zj is good at x for every j. When the weights and biases of 𝒩 are independent, we may write ρbz1,,bzk(b1,,bk) as

j=1kρbzj(bj)(supneuronszsupbρbz(b))k=Cbiask.

Hence,

Yz1,,zk(x)Cbiask(det(Jz1,,zk(x)(Jz1,,zk(x))T))1/2.

Note that

Jz1,,zk(x)(Jz1,,zk(x))T=Gram(z1(x),,zk(x)),

where for any vin

Gram(v1,,vk)i,j=vi,vj

is the associated Gram matrix. The Gram identity says that det(Jz1,,zk(x)(Jz1,,zk(x))T)1/2 equals

z1(x)zk(x),

which is the the k-dimensional volume of the parallelopiped in nin spanned by {zj(x),j=1,,k}. We thus have

det(Jz1,,zk(x)(Jz1,,zk(x))T)1/2j=1kzj(x).

The estimate (14) proves the upper bound (15). For the special case of ϕ=ReLU we use the AM-GM inequality and Jensen’s inequality to write

𝔼[j=1kzj(x)] 𝔼[(1kj=1kzj(x))k]
1kj=1k𝔼[zjk].

Therefore, by Theorem 1 of Hanin & Nica (2018), there exist C1,C2>0 so that

𝔼[j=1kzj(x)](C1eC2j=1d1nj)k.

This completes the proof of the upper bound in (15). To prove the power bound, lower bound in (15) we must argue in a different way. Namely, we will induct on k and use the following facts to prove the base case k=1:

  1. 1.

    At initialization, for each fixed input x, the random variables {𝟏{z(x)>bz}} are independent Bernoulli random variables with parameter 1/2. This fact is proved in Proposition 2 of Hanin & Nica (2018). In particular, the event {z is good at x}, which occurs when there exists a layer j(z)+1,,d in which z(x)bz for every neuron, is independent of {z(x),bz} and satisfies

    (z is good at x)1-j=1d2-nj. (31)
  2. 2.

    At initialization, for each fixed input x, we have

    12𝔼[z(x)2]=x2nin+j=1(z)σbj2, (32)

    where σbj2:=Var[biasesatlayerj]. This is Equation (11) in the proof of Theorem 5 from Hanin & Rolnick (2018).

  3. 3.

    At initialization, for every neuron z and each input x, we have

    𝔼[z(x)2]=2. (33)

    This follows easily from Theorem 1 of Hanin (2018).

  4. 4.

    At initialization, for each 1jnin and every xnin

    𝔼[log(nin(zxj(x))2)]=-52j=1(z)1nj (34)

    plus O(j=1(z)1nj2), where nj is the width of the jth hidden layer and the implied constant depends only on the 4th moment of the measure μ according to which weights are distributed. This estimate follows immediately by combining Corollary 26 and Proposition 28 in Hanin & Nica (2018).

We begin by proving the lower bound in (15) when k=1. We use (31) to see that 𝔼[volnin-1(𝒩K)] is bounded below by

(1-j=1d2-nj)neuronszK𝔼[z(x)ρbz(z(x))]𝑑x.

Next, we bound the integrand. Fix xnin and a parameter η>0 to be chosen later. The integrand 𝔼[z(x)ρbz(z(x))] is bounded below by

𝔼[z(x)ρbz(z(x))𝟏{|z(x)|}η]
 [inf|b|ηρbz(b)]𝔼[z(x)𝟏{|z(x)|η}],

which is bounded below by

[inf|b|ηρbz(b)][𝔼[z(x)]-𝔼[z(x)𝟏{|z(x)|}>η]].

Using Cauchy-Schwarz, the term 𝔼[z(x)𝟏{|z(x)|}>η] is bounded above by

(𝔼[z(x)]2(|z(x)|>η))1/2,

which using (33) and (32) together with Markov’s inequality, is bounded above by

2η1/2(x2nin+j=1(z)σbj2)1/2.

Next, using Jensen’s inequality twice, we write

log𝔼[z(x)]  12𝔼[log(z(x)2)]
 =12𝔼[log(j=1nin(zxj(x))2)]
 12𝔼[log(nin1/2zxj(x))2]
 =-54j=1(z)1nj+O(j=1(z)1nj2),

where in the last inequality we applied (34). Putting this all together, we find that exists c>0 so that

𝔼[z(x)ρbz(z(x))]c[inf|b|ηρbz(b)],

where

η4(x2nin+j=1dσbj2)e54j=1d1nj+O(j=1(z)1nj2).

In particular, we may take

η=(supxKx2nin+j=1dσbj2)eCj=1d1nj

for C sufficiently large. This completes the proof of the lower bound in (15) when k=1. To complete the proof of Corollary 7, suppose we have proved the lower bound in (15) for all ReLU networks 𝒩 and all collections of k-1 distinct neurons. We may assume after relabeling that the neurons z1,,zk are ordered by layer index:

(z1)(zk).

With probability 1, the set Sz1nin is piecewise linear, co-dimension 1 with finitely many pieces, which we denote by Pα. We may therefore rewrite volnin-k(S~z1,,zkK) as

αvolnin-k(S~z2,,zkPαK).

We now define a new neural network 𝒩α, obtained by restricting 𝒩 to Pα. The input dimension for 𝒩α equals nin-1, and the weights and biases of 𝒩α satisfy all the assumptions of Corollary 7. We can now apply our inductive hypothesis to the k-1 neurons z2,,zk in 𝒩α and to the set KPα. This gives

𝔼[αvolnin-k(S~z2,,zkPαK)]
(infzinf|b|ηρbz(b))k-1𝔼[volnin-1(PαK)].

Summing this lower bound over α yields

𝔼[volnin-k(S~z1,,zkK)]
(infzinf|b|ηρbz(b))k-1𝔼[volnin-1(S~z1K)].

Applying the inductive hypothesis once more completes the proof.

Appendix E Proof of Corollary 8

We will need the following observation.

Lemma 12.

Fix a positive integer n1, and let SRn be a compact continuous piecewise linear submanifold with finitely many pieces. Define S0= and let Sk be the union of the interiors of all k-dimensional pieces of S\(S0Sk-1). Denote by Tε(X) the ε-tubular neighborhood of any XRn. We have

voln(Tε(S))k=0nωn-kεn-kvolk(Sk),

where ωd:=volume of ball of radius 1 in Rd.

Proof.

Define d to be the maximal dimension of the linear pieces in S. Let xTε(S). Suppose xTε(Sk) for all k=0,,d-1. Then the intersection of the ball of radius ε around s with S is a ball inside SdUd. Using the convexity of this ball, there exists a point y in Sd so that the vector x-y is parallel to the normal vector to Sd at y. Hence, x belong to the normal ε-ball bundle Bε(N*(Sd)) (i.e. the union of the fiber-wise ε-balls in the normal bundle to Sd). Therefore, we have

voln(Tε(S))voln(Bε(N*(Sd)))+voln(Tε(Sd-1)),

where we abbreviated Sd-1:=k=0d-1Sk¯. Using that

voln(Bε(N*(Sd))) =vold(Sd)voln-d(Bε(n-d))
=vold(Sd)εn-dωn-d

and repeating this argument d-1 times completes the proof. ∎

We are now ready to prove Corollary 2. Let xK=[0,1]nin be uniformly chosen. Then, for any ε>0, using Markov’s inequality and Lemma 12, we have

𝔼[distance(x,𝒩)]
ε(distance(x,𝒩)>ε)
=ε(1-(distance(x,𝒩)ε))
=ε(1-𝔼[volnin(Tε(𝒩))])
ε(1-k=1ninωnin-kεnin-k𝔼[volnin-k(𝒩,k)])
ε(1-k=1nin(CgradCbiasε#{neurons})k)
ε(1-CCgradCbiasε#{neurons})

for some C>0. Taking ε to be a small constant times 1/(Cgrad#{neurons}) completes the proof.