Abstract
The success of deep learning is thanks to our ability to solve certainmassive non-convex optimization problems with relative ease. Despite non-convexoptimization being NP-hard, simple algorithms -- often variants of stochasticgradient descent -- exhibit surprising effectiveness in fitting large neuralnetworks in practice. We argue that neural network loss landscapes contain(nearly) a single basin, after accounting for all possible permutationsymmetries of hidden units. We introduce three algorithms to permute the unitsof one model to bring them into alignment with units of a reference model. Thistransformation produces a functionally equivalent set of weights that lie in anapproximately convex basin near the reference model. Experimentally, wedemonstrate the single basin phenomenon across a variety of model architecturesand datasets, including the first (to our knowledge) demonstration ofzero-barrier linear mode connectivity between independently trained ResNetmodels on CIFAR-10 and CIFAR-100. Additionally, we identify intriguingphenomena relating model width and training time to mode connectivity across avariety of models and datasets. Finally, we discuss shortcomings of a singlebasin theory, including a counterexample to the linear mode connectivityhypothesis.