Deep Neural Networks with Multi-Branch Architectures Are Less Non-Convex

Abstract

Several recently proposed architectures of neural networks such as ResNeXt,Inception, Xception, SqueezeNet and Wide ResNet are based on the designing ideaof having multiple branches and have demonstrated improved performance in manyapplications. We show that one cause for such success is due to the fact thatthe multi-branch architecture is less non-convex in terms of duality gap. Theduality gap measures the degree of intrinsic non-convexity of an optimizationproblem: smaller gap in relative value implies lower degree of intrinsicnon-convexity. The challenge is to quantitatively measure the duality gap ofhighly non-convex problems such as deep neural networks. In this work, weprovide strong guarantees of this quantity for two classes of networkarchitectures. For the neural networks with arbitrary activation functions,multi-branch architecture and a variant of hinge loss, we show that the dualitygap of both population and empirical risks shrinks to zero as the number ofbranches increases. This result sheds light on better understanding the powerof over-parametrization where increasing the network width tends to make theloss surface less non-convex. For the neural networks with linear activationfunction and $\ell_2$ loss, we show that the duality gap of empirical risk iszero. Our two results work for arbitrary depths and adversarial data, while theanalytical techniques might be of independent interest to non-convexoptimization more broadly. Experiments on both synthetic and real-worlddatasets validate our results.

Quick Read (beta)

loading the full paper ...