Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently

Abstract

Driven by the empirical success and wide use of deep neural networks,understanding the generalization performance of overparameterized models hasbecome an increasingly popular question. To this end, there has beensubstantial effort to characterize the implicit bias of the optimizationalgorithms used, such as gradient descent (GD), and the structural propertiesof their preferred solutions. This paper answers an open question in thisliterature: For the classification setting, what solution does mirror descent(MD) converge to? Specifically, motivated by its efficient implementation, weconsider the family of mirror descent algorithms with potential function chosenas the $p$-th power of the $\ell_p$-norm, which is an important generalizationof GD. We call this algorithm $p$-$\textsf{GD}$. For this family, wecharacterize the solutions it obtains and show that it converges in directionto a generalized maximum-margin solution with respect to the $\ell_p$-norm forlinearly separable classification. While the MD update rule is in generalexpensive to compute and perhaps not suitable for deep learning,$p$-$\textsf{GD}$ is fully parallelizable in the same manner as SGD and can beused to train deep neural networks with virtually no additional computationaloverhead. Using comprehensive experiments with both linear and deep neuralnetwork models, we demonstrate that $p$-$\textsf{GD}$ can noticeably affect thestructure and the generalization performance of the learned models.

Quick Read (beta)

loading the full paper ...