It is well-known that the expressivity of a neural network depends on itsarchitecture, with deeper networks expressing more complex functions. In thecase of networks that compute piecewise linear functions, such as those withReLU activation, the number of distinct linear regions is a natural measure ofexpressivity. It is possible to construct networks with merely a single region,or for which the number of linear regions grows exponentially with depth; it isnot clear where within this range most networks fall in practice, either beforeor after training. In this paper, we provide a mathematical framework to countthe number of linear regions of a piecewise linear network and measure thevolume of the boundaries between these regions. In particular, we prove thatfor networks at initialization, the average number of regions along anyone-dimensional subspace grows linearly in the total number of neurons, farbelow the exponential upper bound. We also find that the average distance tothe nearest region boundary at initialization scales like the inverse of thenumber of neurons. Our theory suggests that, even after training, the number oflinear regions is far below exponential, an intuition that matches ourempirical observations. We conclude that the practical expressivity of neuralnetworks is likely far below that of the theoretical maximum, and that this gapcan be quantified.
Quick Read (beta)
Complexity of Linear Regions in Deep Networks
It is well-known that the expressivity of a neural network depends on its architecture, with deeper networks expressing more complex functions. In the case of networks that compute piecewise linear functions, such as those with ReLU activation, the number of distinct linear regions is a natural measure of expressivity. It is possible to construct networks with merely a single region, or for which the number of linear regions grows exponentially with depth; it is not clear where within this range most networks fall in practice, either before or after training. In this paper, we provide a mathematical framework to count the number of linear regions of a piecewise linear network and measure the volume of the boundaries between these regions. In particular, we prove that for networks at initialization, the average number of regions along any one-dimensional subspace grows linearly in the total number of neurons, far below the exponential upper bound. We also find that the average distance to the nearest region boundary at initialization scales like the inverse of the number of neurons. Our theory suggests that, even after training, the number of linear regions is far below exponential, an intuition that matches our empirical observations. We conclude that the practical expressivity of neural networks is likely far below that of the theoretical maximum, and that this gap can be quantified.
A growing field of theory has sought to explain the broad success of deep neural networks via a mathematical characterization of the ability of these networks to approximate different functions of input data. Many such works consider the expressivity of neural networks, showing that certain functions are more efficiently expressible by deep architectures than by shallow ones (e.g. Bianchini & Scarselli (2014); Montufar et al. (2014); Telgarsky (2015); Lin et al. (2017); Rolnick & Tegmark (2018)). It has, however, also been noted that many expressible functions are not efficiently learnable, at least by gradient descent (Shalev-Shwartz et al., 2018). More generally, the typical behavior of a network used in practice, the practical expressivity, may be very different from what is theoretically attainable. To adequately explain the power of deep learning, it is necessary to consider networks with parameters as they will naturally occur before, during, and after training.
Networks with a piecewise linear activation (e.g. ReLU, hard ) compute piecewise linear functions for which input space is divided into pieces, with the network computing a single linear function on each piece (see Figures 1-4). Figure 2 shows how the complexity of these pieces, which we refer to as linear regions, changes in a deep ReLU net with two-dimensional inputs. Each neuron in the first layer splits the input space into two pieces along a hyperplane, fitting a different linear function to each of the pieces. Subsequent layers split the regions of the preceding layers. The local density of linear regions serves as a convenient proxy for the local complexity or smoothness of the network, with the ability to interpolate a complex data distribution seeming to require fitting many relatively small regions. The topic of counting linear regions is taken up by a number of authors (Telgarsky, 2015; Montufar et al., 2014; Serra et al., 2018; Raghu et al., 2017).
A worst case estimate is that every neuron in each new layer splits each of the regions present at the previous layer, giving a number of regions exponential in the depth. Indeed this is possible, as examined extensively e.g. in Montufar et al. (2014). An example of Telgarsky (2015) shows that a sawtooth function with teeth can be expressed exactly using only neurons, as shown in Figure 3. However, even slightly perturbing this network (by adding noise to the weights and biases) ruins this beautiful structure and severely reduces the number of linear pieces, raising the question of whether typical neural networks actually achieve the theoretical bounds for numbers of linear regions.
Figure 1 also invites measures of complexity for piecewise linear networks beyond region counting. The boundary between two linear regions can be straight or can be bent in complex ways, for example, suggesting the volume of the boundary between linear regions as complexity measure for the resulting partition of input space. In the 2D example of Figure 1, this corresponds to computing perimeters of the linear pieces. As we detail below, this measure has another natural advantage: the volume of the boundary controls the typical distance from a random input to the boundary of its linear region (see §2.2). This measures the stability of the function computed by the network, and it is intuitively related to robustness under adversarial perturbation.
Our Contributions. In this paper, we provide mathematical tools for analyzing the complexity of linear regions of a network with piecewise linear activations (such as ReLU) before, during, and after training. Our main contributions are as follows:
For networks at initialization, the total surface area of the boundary between linear regions scales as the number of neurons times the number of breakpoints of the activation function. This is our main result, from which several corollaries follow (see Theorem 3, Corollary 4, and the discussion in §2).
In particular, for any line segment through input space, the average number of regions intersecting it is linear in the number of neurons, far below the exponential number of regions that is theoretically attainable.
We find empirically that both the number of regions and the distance to the nearest region boundary stay roughly constant during training and in particular are far from their theoretical maxima. That this should be the case is strongly suggested by Theorem 3, though not a direct consequence of it.
Overall, our results stress that practical expressivity lags significantly behind theoretical expressivity. Moreover, both our theoretical and empirical findings suggest that for certain measures of complexity, trained deep networks are remarkably similar to the same networks at initialization.
In the next section, we informally state our theoretical and empirical results and explore the underlying intuitions. Detailed descriptions of our experiments are provided in §3. The precise theorem statements for ReLU networks can be found in §5. The exact formulations for general piecewise linear networks are in Appendix A, with proofs in the rest of the Supplementary Material. In particular, Appendix B contains intuition for how our proofs are shaped, while details are completed in §C-D.
2 Informal Overview of Results
This section gives an informal introduction to our results. We begin in §2.1 by describing the case of networks with input dimension In §2.2, we consider networks with higher input dimension. For simplicity, we focus throughout this section on fully connected ReLU networks. We emphasize, however, that our results apply to any piecewise linear activation. Moreover, the upper bounds we present in Theorems 1, 2, and 3 (and hence in Corollaries 4 and 5) can also be generalized to hold for feed-forward networks with arbitrary connectivity, though we do not go into details in this work, for the sake of clarity of exposition.
2.1 Number of Regions in 1D
Consider the simple case of a net with input and output dimensions equal to Such a network computes a piecewise linear function (see Figure 4), and we are interested in understanding both at initialization and during training the number of distinct linear regions. There is a simple universal upper bound:
where the maximum is over all settings of weight and biases. This bound depends on the architecture of only via the number of neurons. For more refined upper bounds which take into account the widths of the layers, see Theorem 1 in Raghu et al. (2017) and Theorem 1 in Serra et al. (2018).
The constructions in Montufar et al. (2014); Telgarsky (2015); Raghu et al. (2017); Serra et al. (2018) indicate that the bound in (1) is very far from sharp for shallow and wide networks but that exponential growth in the number of regions can be achieved in deep, skinny networks for very special choices of weights and biases. This is a manifestation of the expressive power of depth, the idea that repeated compositions allow deep networks to capture complex hierarchical relations more efficiently per parameter than their shallow cousins. However, there is no non-trivial lower bound for the number of linear regions:
The minimum is attained by setting all weights and biases to This raises the question of the behavior for the average number of regions when the weights and biases are chosen at random (e.g. at initialization). Intuitively, configurations of weights and biases that come close to saturating the exponential upper bound (1) are numerically unstable in the sense that a small random perturbation of the weights and biases drastically reduces the number of linear regions (see Figure 3 for an illustration). In this direction, we prove a somewhat surprising answer to the question of how many regions has at initialization. We state the result for but note that it holds for any piecewise linear, continuous activation function (see Theorems 3 and 6).
Theorem 1 (informal).
Let be a network with piecewise linear activation with input and output dimensions of both equal . Suppose the weights and biases are randomly initialized so that for each neuron , its pre-activation has bounded mean gradient
This holds, for example, for networks initialized with independent, zero-centered weights with variance Then, for each subset of inputs, the average number of linear regions inside is proportional to the number of neurons times the length of
where is the number of breakpoints in the non-linearity of (for ReLU nets, ). The same result holds when computing the number of linear regions along any fixed -dimensional curve in a high-dimensional input space.
This theorem implies that the average number of regions along a one-dimensional curve in input space is proportional to the number of neurons, but independent of the arrangement of those neurons. In particular, a shallow network and a deep network will have the same complexity, by this measure, as long as they have the same total number of neurons. Of course, as grows, the bounds in Theorem 1 become less sharp. We plan to extend our results to obtain bounds on the total number of regions on all of in the future. In particular, we believe that at initialization the mean total number of linear regions is proportional to the number of neurons (this is borne out in Figure 5, which computes the total number of regions on an infinite line).
Theorem 1 defies the common intuition that, on average, each layer in multiplies the number of regions formed up to the previous layer by a constant larger than one. This would imply that the average number of regions is exponential in the depth. To provide intuition for why this is not true for random weights and biases, consider the effect of each neuron separately. Suppose the pre-activation of a neuron satisfies , a hallmark of any reasonable initialization. Then, over a compact set of inputs, the piecewise linear function cannot be highly oscillatory over a large portion of the range of . Thus, if the bias is not too concentrated on any interval, we expect the equation to have solutions. On average, then, we expect that each neuron adds a constant number of new linear regions. Thus, the average total number of regions should scale roughly as the number of neurons.
Theorem 1 follows from a general result, Theorem 3, that holds for essentially any non-degenerate distribution of weights and biases and with any input dimension. If and the bias distribution are well-behaved, then throughout training, Theorem 3 suggests the number of linear regions along a -dimensional curve in input space scales like the number of neurons in . Figures 5-6 show experiments that give empirical verification of this heuristic.
2.2 Higher-Dimensional Regions
For networks with input dimension exceeding there are several ways to generalize counting linear regions. A unit-matching heuristic applied to Theorem 1 suggests
Proving this statement is work in progress by the authors. Instead, we consider here a natural and, in our view, equally important generalization. Namely, for a bounded , we consider the -dimensional volume density
is the boundary of the linear regions for . When ,
and hence the volume density (3) truly generalizes to higher input dimension of the number of regions. One reason for studying the volume density (3) is that it gives bounds from below for , which in turn provides insight into the nature of the computation performed by Indeed, the exact formula
shows that measures the sensitivity over neurons at a given input . In this formula, denotes the pre-activation for a neuron and is its bias, so that is the post-activation. Moreover, the distance from a typical point to gives a heuristic lower bound for the typical distance to an adversarial example: two inputs closer than the typical distance to a linear region boundary likely fall into the same linear region, and hence are unlikely to be classified differently. Our next result generalizes Theorem 1.
Theorem 2 (informal).
Let be a network with a piecewise linear activation, input dimension and output dimension Suppose its weights and biases are randomly initialized as in (2). Then, for bounded, the average volume of the linear region boundaries in satisfies:
where is the number of breakpoints in the non-linearity of (for ReLU nets, ). Moreover, if is uniformly distributed, then the average, over both and the weights/biases of , distance from to satisfies
Experimentally, remains comparable to throughout training (see Figure 6).
We empirically verified our theorems and further examined how linear regions of a network change during training. All experiments below were performed with fully-connected networks, initialized with He normal weights (i.i.d. with variance ) and biases drawn i.i.d. normal with variance (to prevent collapse of regions at initialization, which occurs when all biases are uniquely zero). Training was performed on the vectorized MNIST (input dimension 784) using the Adam optimizer at learning rate . All networks attain test accuracy in the range .
3.1 Number of Regions Along a Line
We calculated the number of regions along lines through the origin and and a random selected training example in input space. For each setting of weights and biases within the network during training, the number of regions along each line is calculated exactly by building up the network one layer at a time and calculating how each region is split by the next layer of neurons. Figure 5 represents the average over 5 independent runs, from each of which we sample 100 lines; variance across the different runs is not significant.
Figure 5 plots the average number of regions along a line, divided by the number of neurons in the network, as a function of epoch during training. We make several observations:
As predicted by Theorem 3, all networks start out with the number of regions along a line equal to a constant times the number of neurons in the network (the constant in fact appears very close to 1 in this case).
The number of regions actually decreases during the initial part of training, then increases again. We explore this behavior further in other experiments below.
3.2 Distance to the Nearest Region Boundary
We calculated the average distance to the nearest boundary for randomly selected input points, for various networks throughout training. Points were selected randomly from a normal distribution with mean and variance matching the componentwise mean and variance of MNIST training data. Results were averaged over independent runs, but variance across runs is not significant. Rerunning these experiments with sample points selected randomly from (i) the training data or (ii) the test data yielded similar results to random sample points.
In keeping with our results in the preceding experiment, the distance to the nearest boundary first increases then decreases during training. As predicted by Theorem 2, we find that for all networks, the distance to the nearest boundary is well-predicted by . Throughout training, we find that it approximately varies between the curves and (Figure 6(a)). At initialization, as we predict, all networks have the same value for the product of number of neurons and distance to the nearest region boundary (Figure 6(b)); these products then diverge (slightly) for different architectures, first increasing rapidly and then decreasing more slowly.
We find Figure 6(c) fascinating, though we do not completely understand it. It plots the product of number of neurons and distance to the nearest region boundary against the test accuracy. It suggests two phases of training: first regions expand, then they contract. This lines up with observations made in Arpit et al. (2017) that neural networks “learn patterns first” on which generalization is simple and then refine the fit to encompass memorization of individual samples. A generalization phase would suggest that regions are growing, while memorization would suggest smaller regions are fit to individual data points. This is, however, speculation and more experimental (and theoretical) exploration will be required to confirm or disprove this intuition.
We found it instructive to consider the full distribution of distances from sample points to their nearest boundaries, rather than just the average. For a single network (depth 4, width 16), Figure 7 indicates that this distribution does not significantly change during training, although there appears to be a slight skew towards larger regions, in agreement with the findings in Novak et al. (2018). The histogram shows -distances. Hence, distance to the nearest region boundary varies over many orders of magnitude. This is consistent with Figures 1 and 4, which lend credence to the intuition that small distances to the nearest region boundary are explained by the presence of many small regions. According to Theorem 3, this should correlate with a combination of regions in input space at which some neurons have a large gradient and neurons with highly peaked biases distributions. We hope to return to this in future work.
3.3 Regions Within a 2D Plane
We visualized the regions of a network through training. Specifically, following experiments in Novak et al. (2018), we plotted regions within a plane in the -dimensional input space (Figure 8) through three data points with different labels (, , and , in our case) inside a square centered at the circumcenter of the three examples. The network shown has depth and width . We observe that, as expected from our other plots, the regions expand initially during training and then contract again. We expect the number of regions within a -dimensional subspace to be on the order of the square of the number of neurons – that is, , which we indeed find.
Our approach for calculating regions is simple. We start with a single region (in this case, the square), and subdivide it by adding neurons to the network one by one. For each new neuron, we calculate the linear function it defines on each region, and determine whether that region is split into two. This approach terminates within a reasonable amount of time precisely because our theorem holds: there are relatively few regions. Note that we exactly determine all regions within the given square by calculating all region boundaries; thus our counts are exact and do not miss any small regions, as might occur if we merely estimated regions by sampling points from input space.
4 Related Work
There are a number of works that touch on the themes of this article: (i) the expressivity of depth; (ii) counting the number of regions in networks with piecewise linear activations; (iii) the behavior of linear regions through training; and (iv) the difference between expressivity and learnability. Related to (i), we refer the reader to Eldan & Shamir (2016); Telgarsky (2016) for examples of functions that can be efficiently represented by deep but not shallow ReLU nets. Next, still related to (i), for uniform approximation over classes of functions, again using deep ReLU nets, see Yarotsky (2017); Rolnick & Tegmark (2018); Yarotsky (2018); Petersen & Voigtlaender (2018). For interesting results on (ii) about counting the maximal possible number of linear regions in networks with piecewise linear activations see Bianchini & Scarselli (2014); Montufar et al. (2014); Poole et al. (2016); Arora et al. (2018); Raghu et al. (2017). Next, in the vein of (iii), for both a theoretical and empirical perspective on the number of regions computed by deep networks and specifically how the regions change during training, see Poole et al. (2016); Novak et al. (2018). In the direction of (iv), we refer the reader to Shalev-Shwartz et al. (2018); Hanin & Rolnick (2018); Hanin (2018). Finally, for general insights into learnability and expressivity in deep vs. shallow networks see Mhaskar & Poggio (2016); Mhaskar et al. (2016); Zhang et al. (2017); Lin et al. (2017); Poggio et al. (2017); Neyshabur et al. (2017).
5 Formal Statement of Results
To state our results precisely, we fix some notation. Let and consider a depth fully connected net with input dimension , output dimension , and hidden layer widths As explained in the introduction, a generic configuration of its weights and biases partitions the input space into a union of polytopes with disjoint interiors. Restricted to each computes a linear function.
More precisely, we set and recursively define to be the set of points so that in a neighborhood of the set coincides with a co-dimension hyperplane.
For example, when the linear regions are polygons, the set is the union of the open line segments making up the boundaries of the , and is the collection of vertices of the Theorem 3 provides a convenient formula for the average of the dimensional volume of inside any bounded, measurable set . To state the result, for every neuron in we will write
and Thus, for a given input , the post-activation of is
Theorem 3 holds under the following assumption on the distribution of weights and biases:
The conditional distribution of any collection of biases , given all the other weights and biases, has a density with respect to Lebesgue measure on .
The joint distribution of all the weights has a density with respect to Lebesgue measure on .
These assumptions hold in particular when the weights and biases of are independent with marginal distributions that have a density relative to Lebesgue measure on (i.e. at initialization). They hold much more generally, however, and can intuitively be viewed as a non-degeneracy assumption on the behavior of the weights and biases of . Specifically, they are used in Proposition 10 to ensure that the set consists of inputs where exactly neurons turn off/on. Assumption (A1) also allows us, in Proposition 11, to apply the co-area formula (29) to compute the expect volume of the set of inputs where a given collection of neurons turn on/off. Our main result is the following.
Suppose is a feed-forward net with input dimension output dimension , and random weights/biases. Assume that the distribution of weights/biases satisfies Assumptions and above. Then, with the notation (6), for any bounded measurable set and any the average dimensional volume of inside is
of inside is, in the notation (6),
times the indicator function of the event that for each . Here, is the Jacobian of the map
the function is the density of the joint distribution of the biases , and we say a neuron is good at if there exists a path of neurons from to the output in the computational graph of so that each neuron along this path is open at ).
To evaluate the expression in (8) requires information on the distribution of gradients , the pre-activations , and the biases Exact information about these quantities is available at initialization (Hanin, 2018; Hanin & Rolnick, 2018; Hanin & Nica, 2018), yielding the following Corollary.
With the notation and assumptions of Theorem 3, suppose the weights are independent are drawn from a fixed probability measure on that is symmetric around and then rescaled to have . Fix . Then there exists for which
where depends only on but not on the architecture of and is the width of the hidden layer. Moreover, we also have similar lower bounds
with depending only on the distribution of the weights in .
6 Conclusions and Further Work
The question of why depth is powerful has been a persistent problem for deep learning theory, and one that recently has been answered by works giving enhanced expressivity as the ultimate explanation. However, our results suggest that such explanations may be misleading. While we do not speak to all notions of expressivity in this paper, we have both theoretically and empirically evaluated one common measure: the linear regions in the partition of input space defined by a network with piecewise linear activations. We found that the average size of the boundary of these linear regions depends only on the number of neurons and not on the network depth – both at initialization and during training. This strongly suggests that deeper networks do not learn more complex functions than shallow networks. We plan to test this interpretation further in future work – for example, with experiments on more complex tasks, as well as by investigating higher order statistics, such as the variance.
We do not propose a replacement theory for the success of deep learning; however, prior work has already hinted at how such a theory might proceed. Notably, Ba & Caruana (2014) show that, once deep networks are trained to perform a task successfully, their behavior can often be replicated by shallow networks, suggesting that the advantages of depth may be linked to easier learning.
- Arora et al. (2018) Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. In ICLR, 2018.
- Arpit et al. (2017) Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In ICML, 2017.
- Ba & Caruana (2014) Ba, J. and Caruana, R. Do deep nets really need to be deep? In NeurIPS, pp. 2654–2662, 2014.
- Bianchini & Scarselli (2014) Bianchini, M. and Scarselli, F. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553–1565, 2014.
- Eldan & Shamir (2016) Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In COLT, pp. 907–940, 2016.
- Hanin (2018) Hanin, B. Which neural net architectures give rise to exploding and vanishing gradients? In NeurIPS, 2018.
- Hanin & Nica (2018) Hanin, B. and Nica, M. Products of many large random matrices and gradients in deep neural networks. Preprint arXiv:1812.05994, 2018.
- Hanin & Rolnick (2018) Hanin, B. and Rolnick, D. How to start training: The effect of initialization and architecture. In NeurIPS, 2018.
- Lin et al. (2017) Lin, H. W., Tegmark, M., and Rolnick, D. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
- Mhaskar et al. (2016) Mhaskar, H., Liao, Q., and Poggio, T. Learning functions: when is deep better than shallow. Preprint arXiv:1603.00988, 2016.
- Mhaskar & Poggio (2016) Mhaskar, H. N. and Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
- Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In NeurIPS, pp. 2924–2932, 2014.
- Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. In NeurIPS, pp. 5947–5956, 2017.
- Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and generalization in neural networks: an empirical study. In ICLR, 2018.
- Petersen & Voigtlaender (2018) Petersen, P. and Voigtlaender, F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
- Poggio et al. (2017) Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. Why and when can deep – but not shallow – networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
- Poole et al. (2016) Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In NeurIPS, pp. 3360–3368, 2016.
- Raghu et al. (2017) Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Dickstein, J. S. On the expressive power of deep neural networks. In ICML, pp. 2847–2854, 2017.
- Rolnick & Tegmark (2018) Rolnick, D. and Tegmark, M. The power of deeper networks for expressing natural functions. In ICLR, 2018.
- Serra et al. (2018) Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In ICML, 2018.
- Shalev-Shwartz et al. (2018) Shalev-Shwartz, S., Shamir, O., and Shammah, S. Failures of gradient-based deep learning. In ICML, 2018.
- Telgarsky (2015) Telgarsky, M. Representation benefits of deep feedforward networks. Preprint arXiv:1509.08101, 2015.
- Telgarsky (2016) Telgarsky, M. Benefits of depth in neural networks. In COLT, 2016.
- Yarotsky (2017) Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
- Yarotsky (2018) Yarotsky, D. Optimal approximation of continuous functions by very deep ReLU networks. In COLT, 2018.
- Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
Appendix A Formal Statement of Results for General Piecewise Linear Activations
In §5, we stated our results in the case of ReLU activation, and now frame these results for a general piecewise linear non-linearity. We fix some notation. Let be a continuous piecewise linear function with breakpoints That is, there exist so that
The analog of Theorem 3 for general is the following.
Let be a continuous piecewise linear function with breakpoints as in (11). Suppose is a fully connected network with input dimension output dimension , random weights and biases satisfying and above, and non-linearity .
Let be the Jacobian of the map
and write for the density of the joint distribution of the biases . We say a neuron is good at if there exists a path of neurons from to the output in the computational graph of so that each neuron along this path is open at (i.e. ).
Then, for any bounded, measurable set and any the average –dimensional volume
of inside is, in the notation of (6),
multiplied by the indicator function of the event that is good at for every
Note that if in the definition (11) of we have that the possible values do not include , then we may ignore the event that are good at in the definition of
With the notation and assumptions of Theorem 6, suppose in addition that the weights and biases are independent. Fix and suppose that for every collection of distinct neurons , the average magnitude of the product of gradients is uniformly bounded:
Then we have the following upper bounds
where is the number of breakpoints in the non-linearity of (see (11)) and
Appendix B Outline of Proof of Theorem 6
The purpose of this section is to give an intuitive explanation of the proof of Theorem 3. We fix a non-linearity with breakpoints (as in (11)) and consider a fully connected network with input dimension , output dimension , and non-linearity For each neuron in , we write
Intuitively, the set is the collection of inputs for which the neuron turns from on to off. In contrast, the set is the collection of inputs for which is open in the sense that there is a path from the input to the output of so that all neurons along this path compute are not constant in a neighborhood . Thus, is the set of inputs at which neuron switches between its linear regions and at which the output of neuron actually affects the function computed by
We remark here that if in the non-linearity there are no linear pieces at which the slopes on equals (i.e. for all in the definition (11) of ). If, for example, is ReLU, then need not be empty.
The overall proof of Theorem 3 can be divided into several steps. The first gives the following representation of
Under Assumptions and of Theorem 3, we have, with probability
The precise proof of Proposition 9 can be found in §C.1 below. The basic idea is that if for all near a fixed input none of the pre-activations cross the boundary of a linear region for , then Thus, Moreover, if a neuron satisfies for some but there are no open paths from to the output of for inputs near , then is dead at and hence does not influence at Thus, we expect the more refined inclusion . Finally, if for some then unless the contribution from other neurons to for near exactly cancels the discontinuity in This happens with probability .
The next step in proving Theorem 3 is to identify the portions of of each dimension. To do this, we write for any distinct neurons ,
The set is, intuitively, the collection of inputs at which switches between linear regions for and at which the output of is affected by the post-activations of these neurons. Proposition 9 shows that we may represent as a disjoint union
In words, is the collection of inputs in at which exactly neurons turn from on to off. The following Proposition shows that is precisely the “-dimensional piece of ” (see (5)).
Fix and distinct neurons in Then, with probability for every there exists a neighborhood in which coincides with a dimensional hyperplane.
We prove Proposition 10 in §C.2. The idea is that each is piecewise linear and, with probability , at every point at which exactly the neurons contribute to , its co-dimension is the number of linear conditions needed to define it. Observe that with probability , the bias vector for any collection of distinct neurons is a regular value for . Hence,
Proposition 10 thus implies that, with probability
The final step in the proof of Theorem 3 is therefore to prove the following result.
Let be distinct neurons in Then, for any bounded, measurable ,
where is defined as in (13).
from (13). The probability of an infinitesimal neighborhood of belonging to a -dimensional piece of is therefore the probability
Appendix C Proof of Theorem 3
C.1 Proof of Proposition 9
Recall that the non-linearity is continuous and piecewise linear with breakpoints so that, with , we have
with For each write
Intuitively, are the neurons that, at the input are open (i.e. contribute to the gradient of the output ) but do not change their contribution in a neighborhood of , are the neurons that are closed, and are the neurons that, at , produce a discontinuity in the derivative of Thus, for example, if then
We begin by proving that by checking the contrapositive
Fix . Note that are locally constant in the sense that there exists so that for all with , we have
Moreover, observe that if in the definition (11) of none of the slopes equal , then for every . To prove (19), consider any path from the input to the output in the computational graph of Such a path consists of neurons, one in each layer:
To each path we may associate a sequence of weights:
We will also define
For instance, if , then
and in general only one term in the definition of is non-zero for each We may write
Note that if , then for any path through a neuron , we have
This is an open condition in light of (20), and hence for all in a neighborhood of and for any path through a neuron we also have that
Thus, since the summand in (21) vanishes identically if , we find that for in a neighborhood of any we may write
But, again by (20), for any fixed , all in a neighborhood of and each we have as well. Thus, in particular,
Thus, for sufficiently close to we have for every path in the sum (22) that
Therefore, the partial derivatives are independent of in a neighborhood of and hence continuous at . This proves (19). Let us now prove the reverse inclusion:
Note that, with probability we have
for any pair of distinct neurons Note also that since is continuous and piecewise linear, the set is closed. Thus, it is enough to show the slightly weaker inclusion
since the closure of equals Fix a neuron and suppose . By definition, we have that for every neuron either
This has two consequences. First, by (20), the map is linear in a neighborhood of Second, in a neighborhood of the set coincides with . Hence, combining these facts, near the set coincides with the hyperplane
We may take two sequences of inputs on opposite sides of this hyperplane so that
where the index the same as the one that defines the hyperplane (25). Further, since has co-dimension (it is contained in the piecewise linear co-dimension set , for example), we may also assume that Consider any path from the input to the output of the computational graph of passing through (so that ). By construction, for every , we have
and hence, after passing to a subsequence, we may assume that the symmetric difference
of the paths that contribute to the representation (21) for is fixed and non-empty (the latter since it always contains ). For any we may write, for each
Substituting into this expression , we find that there exists a non-empty collection of paths from the input to the output of so that
Note that the expression above is a polynomial in the weights of . Note also that, by construction, this polynomial is not identically zero due to the condition (26). There are only finitely many such polynomials since both and range over a finite alphabet. For each such non-zero polynomial, the set of weights at which it vanishes has co-dimension . Hence, with probability the difference is non-zero. This shows that the partial derivatives are not continuous at and hence that
C.2 Proof of Proposition 10
Fix distinct neurons and suppose but not in for any After relabeling, we may assume that they are ordered by layer index:
Since , we also have that for any Thus, there exists a neighborhood of so for every Thus, there exists a neighborhood of on which is linear.
Hence, as explained near (25) above, is a hyperplane near We now restrict our inputs to this hyperplane and repeat this reasoning to see that, near the set is a hyperplane inside and hence, near , is the intersection of two hyperplanes in . Continuing in this way shows that in a neighborhood of the set is equal to the intersection of hyperplanes in Thus, is precisely the intersection of hyperplanes in a neighborhood of each of its points.
C.3 Proof of Proposition 11
Let be distinct neurons in and fix a compact set . We seek to compute the mean of , which we may rewrite as
where we’ve set
Note that the map is Lipschitz, and recall the co-area formula, which says that if and with is Lipschitz, then
where is the Jacobian of and
We assumed that the biases have a joint conditional density
given all other weights and biases. The mean of the term in (28) corresponding to a fixed over the conditional distribution of is therefore
where we’ve abbreviated as well as . This can rewritten as
Taking the average over the remaining weighs and biases, we may commute the expectation with the integral since the integrand is non-negative. This completes the proof of Proposition 11.
Appendix D Proof of Corollary 7
where, as in (13), is
times the indicator function of the even that is good at for every When the weights and biases of are independent, we may write as
where for any
is the associated Gram matrix. The Gram identity says that equals
which is the the -dimensional volume of the parallelopiped in spanned by We thus have
Therefore, by Theorem 1 of Hanin & Nica (2018), there exist so that
This completes the proof of the upper bound in (15). To prove the power bound, lower bound in (15) we must argue in a different way. Namely, we will induct on and use the following facts to prove the base case :
At initialization, for each fixed input the random variables are independent Bernoulli random variables with parameter This fact is proved in Proposition 2 of Hanin & Nica (2018). In particular, the event , which occurs when there exists a layer in which for every neuron, is independent of and satisfies
At initialization, for each fixed input , we have
where . This is Equation (11) in the proof of Theorem 5 from Hanin & Rolnick (2018).
At initialization, for every neuron and each input we have
This follows easily from Theorem 1 of Hanin (2018).
At initialization, for each and every
plus , where is the width of the hidden layer and the implied constant depends only on the moment of the measure according to which weights are distributed. This estimate follows immediately by combining Corollary 26 and Proposition 28 in Hanin & Nica (2018).
Next, we bound the integrand. Fix and a parameter to be chosen later. The integrand is bounded below by
which is bounded below by
Using Cauchy-Schwarz, the term is bounded above by
Next, using Jensen’s inequality twice, we write
where in the last inequality we applied (34). Putting this all together, we find that exists so that
In particular, we may take
for sufficiently large. This completes the proof of the lower bound in (15) when . To complete the proof of Corollary 7, suppose we have proved the lower bound in (15) for all networks and all collections of distinct neurons. We may assume after relabeling that the neurons are ordered by layer index:
With probability the set is piecewise linear, co-dimension with finitely many pieces, which we denote by . We may therefore rewrite as
We now define a new neural network , obtained by restricting to The input dimension for equals and the weights and biases of satisfy all the assumptions of Corollary 7. We can now apply our inductive hypothesis to the neurons in and to the set This gives
Summing this lower bound over yields
Applying the inductive hypothesis once more completes the proof.
Appendix E Proof of Corollary 8
We will need the following observation.
Fix a positive integer , and let be a compact continuous piecewise linear submanifold with finitely many pieces. Define and let be the union of the interiors of all -dimensional pieces of . Denote by the tubular neighborhood of any We have
Define to be the maximal dimension of the linear pieces in Let Suppose for all Then the intersection of the ball of radius around with is a ball inside . Using the convexity of this ball, there exists a point in so that the vector is parallel to the normal vector to at . Hence, belong to the normal -ball bundle (i.e. the union of the fiber-wise -balls in the normal bundle to ). Therefore, we have
where we abbreviated Using that
and repeating this argument times completes the proof. ∎